<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lecture_2_%5BSTUDENTS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture #2: Exploratory Data Analysis (EDA)**
---
### **Description:**  Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves exploring and understanding the data to gain insights, identify patterns, and detect anomalies. EDA allows us to understand the underlying structure of the data, test assumptions, and prepare the data for further analysis. This lecture notebook will provide practice utilizing basic Pandas commands and exploring rows and columns as part of the EDA process.
<br>


### **Lecture Notebook Structure**

**Part 1**: [Working with rows and columns on the Linnerud dataset](#p1)

**Part 2**: [Exploring the mean, median and sum](#p2)

**Part 3**: [Reviewing EDA](#p3)



</br>


**Goals:** By the end of this project, you will:
* Be able to use basic Pandas commands.
* Be able to explore basic information about datasets.
* Know how to explore columns, rows, values, outliers, and more.
* Be able to remove missing values, or incorrect values.
* Practice data cleaning on real-world datasets.

</br> 

### **Cheat Sheets**
[EDA cheatsheet](https://drive.google.com/file/d/1ZZnIzgcT8dYcGwWVAR9DDFIwGXTGbIiU/view?usp=sharing)

[Pandas cheatsheet](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

<a name="p1"></a>
## **Part 1: Working with rows and columns on the Linnerud dataset**

---

The Linnerud dataset is a small but useful dataset that consists of data on three different exercises performed by 20 middle-aged men at a fitness center. The three exercise variables included in the dataset are `chins`, `situps`, and `jumps`.

- `chins`: number of chin-ups performed by each participant
- `situps`: number of sit-ups performed by each participant
- `jumps`: number of jumping jacks performed by each participant

By analyzing the data, we can gain insights into which exercises are most effective at building strength and endurance in this population.

<br>

**Run the following code before answering questions.**

In [None]:
 # import pandas as pd 
import pandas as pd

# Import datasets submodule
from sklearn import datasets

# Load dataset (actual data with associated documentation)
linnerud = datasets.load_linnerud()

# Create dataframe
linnerud_df = pd.DataFrame(data=linnerud.data,columns=linnerud.feature_names)
linnerud_df

Unnamed: 0,Chins,Situps,Jumps
0,5.0,162.0,60.0
1,2.0,110.0,60.0
2,12.0,101.0,101.0
3,12.0,105.0,37.0
4,13.0,155.0,58.0
5,4.0,101.0,42.0
6,8.0,101.0,38.0
7,6.0,125.0,40.0
8,15.0,200.0,40.0
9,17.0,251.0,250.0


### **Exercise #1:** What is the 16th value for `Chins`?

There are numerous ways to select and display a specific value. Remember, **Python uses 0-based indexing**, so the 16th row has an index of 15. You can either access only the 16th row to retrieve the 16th value, *or* you can select the 16th row and 1st column.

In [None]:
# Print the 16th value for Chins


#### **Exercise #2:** What is the 9th value for `Situps`?

In [None]:
# Print the 9th value for Situps


### **Exercise #3:** Print the entire fourth row.

In [None]:
# Print the entire fourth row


### **Exercise #4:** Print the `Jumps` column.

In [None]:
# Print the Jumps column


### **Exercise #5:** Print the 12th-16th rows.

In [None]:
# Print the 12th-16th rows


<a name="p2"></a>
## **Part 2: Exploring the mean, median and sum**
---
We will continue with the Linnerud dataset, exploring this time the mean, median and sum for each of the variables. 

These statistics can provide insights into the overall distribution of the data, which can be useful in understanding the characteristics of the dataset that help researchers make data-driven decisions.

### **Exercise #1:** Find the mean, median, and sum for `Jumps`.

The mean and median can give an idea of the typical or central value of a variable, while the sum can indicate the total amount of a variable across all observations. These statistics can be used to summarize the data for a report or visualize the data in a chart, as well as for inferential purposes (i.e. testing hypotheses or making predictions). 

In [None]:
# Mean, median and sum for Jumps


---
#### **Try Exercise #2 on your own!**
---

### **Exercise #2:** Find the mean, median, and sum for `Situps`.

In [None]:
# Mean, median and sum for Situps


<a name="p3"></a>

## **Part 3: Reviewing EDA**
---
This dataset contains information on five students and their test scores in three subjects: math, English, and science. 

**Run the cell to import the dataset and create the dataframe.**

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Math Score': [75, 80, 90, 70, 85],
'English Score': [80, 70, 85, 90, 75],
'Science Score': [90, 85, 80, 75, 95]}

student_df = pd.DataFrame(data)
student_df

Unnamed: 0,Name,Math Score,English Score,Science Score
0,Alice,75,80,90
1,Bob,80,70,85
2,Charlie,90,85,80
3,David,70,90,75
4,Emma,85,75,95


### **Exercise #1:** How many rows are in the dataset?

### **Exercise #2:** How many columns are in the dataset?

### **Exercise #3:** What are the column names in the dataset?

### **Exercise #4:** Print the entire second row.

### **Exercise #5:** Print the Math Score column.

### **Exercise #5:** Print the first three rows.

### **Exercise #6:** What is the dimension of the dataframe?

### **Exercise #7:** Print basic statistical data for the dataset (mean, standard deviation, etc).

### **Exercise #8: What is the mean for Science Score?**

In [None]:
"""

" Answer here "


""";

---
© 2023 The Coding School, All rights reserved