<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lecture_2_%5BSOLUTIONS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture #2: Exploratory Data Analysis (EDA)**
---
**Description:**  Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves exploring and understanding the data to gain insights, identify patterns, and detect anomalies. EDA allows us to understand the underlying structure of the data, test assumptions, and prepare the data for further analysis. This lecture notebook will provide practice utilizing basic Pandas commands and exploring rows and columns as part of the EDA process.
<br>

**Lecture Notebook Structure:**

**Part 1**: [Reviewing EDA](#p1)

**Part 2**: [Exploring the mean, median and sum](#p2)

**Part 3**: [Working with rows and columns on the Linnerud dataset](#p3)

**Part 4**: [[OPTIONAL] Reviewing EDA](#p4)



</br>


**Goals:** By the end of this project, you will:
* Be able to use basic Pandas commands.
* Be able to explore basic information about datasets.
* Know how to explore columns, rows, values, outliers, and more.
* Be able to remove missing values, or incorrect values.
* Practice data cleaning on real-world datasets.

</br> 

### **Cheat Sheets**
[EDA cheatsheet](https://drive.google.com/file/d/1ZZnIzgcT8dYcGwWVAR9DDFIwGXTGbIiU/view?usp=sharing)

[Pandas cheatsheet](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

<a name="p1"></a>

## **Part 1: Reviewing EDA**
---
This dataset contains information on five students and their test scores in three subjects: math, English, and science. 

**Run the cell to import the dataset and create the dataframe.**

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Math Score': [75, 80, 90, 70, 85],
'English Score': [80, 70, 85, 90, 75],
'Science Score': [90, 85, 80, 75, 95]}

student_df = pd.DataFrame(data)
student_df

### **Exercise #1:** What are the column names in the dataset?

#### **Solution**

In [None]:
print("Column names:", student_df.columns)

Column names: Index(['Name', 'Math Score', 'English Score', 'Science Score'], dtype='object')


### **Exercise #2:** Print the first three rows.

#### **Solution**

In [None]:
student_df.head(3)

Unnamed: 0,Name,Math Score,English Score,Science Score
0,Alice,75,80,90
1,Bob,80,70,85
2,Charlie,90,85,80


### **Exercise #3:** What is the dimension of the dataframe?

#### **Solution**

In [None]:
print(student_df.shape)

(5, 4)


### **Exercise #4:** Print basic statistical data for the dataset (mean, standard deviation, etc).

#### **Solution**

In [None]:
# Print basic statistical data for the dataset
print(student_df.describe())

       Math Score  English Score  Science Score
count    5.000000       5.000000       5.000000
mean    80.000000      80.000000      85.000000
std      7.905694       7.905694       7.905694
min     70.000000      70.000000      75.000000
25%     75.000000      75.000000      80.000000
50%     80.000000      80.000000      85.000000
75%     85.000000      85.000000      90.000000
max     90.000000      90.000000      95.000000


### **Exercise #5:** What is the mean for Science Score?

In [None]:
"""

" Answer here "


"""

#### **Solution**

In [None]:
# The mean is 85

<a name="p2"></a>
## **Part 2: Working with rows and columns on the Linnerud dataset**

---

The Linnerud dataset is a small but useful dataset that consists of data on three different exercises performed by 20 middle-aged men at a fitness center. The three exercise variables included in the dataset are `chins`, `situps`, and `jumps`.

- `chins`: number of chin-ups performed by each participant
- `situps`: number of sit-ups performed by each participant
- `jumps`: number of jumping jacks performed by each participant

By analyzing the data, we can gain insights into which exercises are most effective at building strength and endurance in this population.

<br>

**Run the following code before answering questions.**

In [None]:
 # import pandas as pd 
import pandas as pd

# Import datasets submodule
from sklearn import datasets

# Load dataset (actual data with associated documentation)
linnerud = datasets.load_linnerud()

# Create dataframe
linnerud_df = pd.DataFrame(data=linnerud.data,columns=linnerud.feature_names)
linnerud_df

### **Exercise #1:** What is the 16th value for `Chins`?

There are numerous ways to select and display a specific value. Remember, **Python uses 0-based indexing**, so the 16th row has an index of 15. You can either access only the 16th row to retrieve the 16th value, *or* you can select the 16th row and 1st column.

In [None]:
# Print the 16th value for Chins


#### **Solution**

In [None]:
# first way
print("way 1: ", linnerud_df["Chins"][15])
# second way
print("way 2: ", linnerud_df.iloc[15, 0])

#### **Exercise #2:** What is the 9th value for `Situps`?

In [None]:
# Print the 9th value for Situps


#### **Solution**

In [None]:
# first way
print("way 1: ", linnerud_df["Situps"][8])
# second way
print("way 2: ", linnerud_df.iloc[8, 1])

### **Exercise #3:** Print the entire fourth row.

In [None]:
# Print the entire fourth row


#### **Solution**

In [None]:
linnerud_df.iloc[3]

### **Exercise #4:** Print the `Jumps` column.

In [None]:
# Print the Jumps column


#### **Solution**

In [None]:
linnerud_df["Jumps"]

### **Exercise #5:** Print the 12th-16th rows.

In [None]:
# Print the 12th-16th rows


#### **Solution**

In [None]:
linnerud_df.iloc[11:16]

<a name="p3"></a>
## **Part 3: Exploring the mean, median and sum**
---
We will continue with the Linnerud dataset, exploring this time the mean, median and sum for each of the variables. 

These statistics can provide insights into the overall distribution of the data, which can be useful in understanding the characteristics of the dataset that help researchers make data-driven decisions.

### **Exercise #1:** Find the mean  for `Jumps`.

The mean and median can give an idea of the typical or central value of a variable, while the sum can indicate the total amount of a variable across all observations. These statistics can be used to summarize the data for a report or visualize the data in a chart, as well as for inferential purposes (i.e. testing hypotheses or making predictions). 

In [None]:
# Mean for Jumps


#### **Solution**

In [None]:
print("mean: ", linnerud_df["Jumps"].mean())


### **Exercise #2:** Find the median  for `Jumps`.


#### **Solution**

In [None]:
print("median: ", linnerud_df["Jumps"].median())


### **Exercise #3:** Find the sum  for `Jumps`.


#### **Solution**

In [None]:
print("sum: ", linnerud_df["Jumps"].sum())

---
#### **Try Exercise #4-6 on your own!**
---

### **Exercise #4:** Find the mean for `Situps`.

In [None]:
# Mean, median and sum for Situps


#### **Solution**

In [None]:
print("mean: ", linnerud_df["Situps"].mean())

### **Exercise #5:** Find the median for `Situps`.

#### **Solution**

In [None]:
print("median: ", linnerud_df["Situps"].median())

### **Exercise #6:** Find the sum for `Situps`.

#### **Solution**

In [None]:
print("sum: ", linnerud_df["Situps"].sum())

<a name="p4"></a>
## **Part 4: EDA Review**
---
This dataset was created manually using information from IMDb, Rotten Tomatoes, and Metacritic.The columns in the dataset include:

- Movie: the title of the movie
- IMDb: the rating of the movie on IMDb, on a scale from 0 to 10
- Rotten Tomatoes: the rating of the movie on Rotten Tomatoes, on a scale from 0 to 100
- Metacritic: the rating of the movie on Metacritic, on a scale from 0 to 100


**Run the code cell below to import the dataframe.**

In [None]:
import pandas as pd

data = {'Movie': ['The Godfather', 'The Shawshank Redemption', 'The Dark Knight', 'Pulp Fiction', 'The Silence of the Lambs'],
        'IMDb': [9.2, 9.3, 9.0, 8.9, 8.6],
        'Rotten Tomatoes': [97, 91, 94, 94, 96],
        'Metacritic': [100, 80, 84, 94, 85]}

movies_df = pd.DataFrame(data)
movies_df


### **Exercise #1:** What are the column names in the dataset?


#### **Solution**

In [None]:
print("Column names:", movies_df.columns)


Column names: Index(['Movie', 'IMDb', 'Rotten Tomatoes', 'Metacritic'], dtype='object')


### **Exercise #2:** Print the entire third row.

#### **Solution**

In [None]:
movies_df.iloc[2]


Movie              The Dark Knight
IMDb                           9.0
Rotten Tomatoes                 94
Metacritic                      84
Name: 2, dtype: object

### **Exercise #3:** Print the IMDb column.

#### **Solution**

In [None]:
movies_df['IMDb']


0    9.2
1    9.3
2    9.0
3    8.9
4    8.6
Name: IMDb, dtype: float64

### **Exercise #4:** Print the first five rows.

#### **Solution**

In [None]:
movies_df.head(5)


Unnamed: 0,Movie,IMDb,Rotten Tomatoes,Metacritic
0,The Godfather,9.2,97,100
1,The Shawshank Redemption,9.3,91,80
2,The Dark Knight,9.0,94,84
3,Pulp Fiction,8.9,94,94
4,The Silence of the Lambs,8.6,96,85


### **Exercise #5:** What is the dimension of the dataframe?

#### **Solution**

In [None]:
print(movies_df.shape)


(5, 4)


### **Exercise #6:** What is the mean for the Metacritic column?

#### **Solution**

In [None]:
print("Metacritic Mean:", movies_df['Metacritic'].mean())


Metacritic Mean: 88.6


### **Exercise #7:** Print the 10th-20th rows.

#### **Solution**

In [None]:
movies_df.iloc[9:20]

### **Exercise #8:** Print the entire 7th row. 

#### **Solution**

In [None]:
movies_df.iloc[6]


---
© 2023 The Coding School, All rights reserved