<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/Copy_of_TRAIN_AWS_P1_Lecture_2_%5BSTUDENTS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture #2: Exploratory Data Analysis (EDA)**
---
### **Description:**  Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves exploring and understanding the data to gain insights, identify patterns, and detect anomalies. EDA allows us to understand the underlying structure of the data, test assumptions, and prepare the data for further analysis. This lecture notebook will provide practice utilizing basic Pandas commands and exploring rows and columns as part of the EDA process.
<br>


### **Lecture Notebook Structure**

**Part 1**: [Working with rows and columns on the Linnerud dataset](#p1)

**Part 2**: [Exploring the mean, median and sum](#p2)

**Part 3**: [Reviewing EDA](#p3)



</br>


**Goals:** By the end of this project, you will:
* Be able to use basic Pandas commands.
* Be able to explore basic information about datasets.
* Know how to explore columns, rows, values, outliers, and more.
* Be able to remove missing values, or incorrect values.
* Practice data cleaning on real-world datasets.

</br> 

### **Cheat Sheets**
[EDA cheatsheet](https://drive.google.com/file/d/1ZZnIzgcT8dYcGwWVAR9DDFIwGXTGbIiU/view?usp=sharing)

[Pandas cheatsheet](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

<a name="p1"></a>
## **Part 1: Working with rows and columns on the Linnerud dataset**

---

The Linnerud dataset is a small but useful dataset that consists of data on three different exercises performed by 20 middle-aged men at a fitness center. The three exercise variables included in the dataset are `chins`, `situps`, and `jumps`.

- `chins`: number of chin-ups performed by each participant
- `situps`: number of sit-ups performed by each participant
- `jumps`: number of jumping jacks performed by each participant

By analyzing the data, we can gain insights into which exercises are most effective at building strength and endurance in this population.

<br>

**Run the following code before answering questions.**

In [None]:
 # import pandas as pd 
import pandas as pd

# Import datasets submodule
from sklearn import datasets

# Load dataset (actual data with associated documentation)
linnerud = datasets.load_linnerud()
#The load_linnerud() function from the scikit-learn datasets module loads a sample dataset containing the physical and physiological measurements of 20 athletes. The dataset has three tables: "exercise", "physiological", and "performance", each with different variables. Here's an example of how to load and access the dataset using Python:

# Create dataframe
linnerud_df = pd.DataFrame(data=linnerud.data,columns=linnerud.feature_names)
linnerud_df

Unnamed: 0,Chins,Situps,Jumps
0,5.0,162.0,60.0
1,2.0,110.0,60.0
2,12.0,101.0,101.0
3,12.0,105.0,37.0
4,13.0,155.0,58.0
5,4.0,101.0,42.0
6,8.0,101.0,38.0
7,6.0,125.0,40.0
8,15.0,200.0,40.0
9,17.0,251.0,250.0


### **Exercise #1:** What is the 16th value for `Chins`?

There are numerous ways to select and display a specific value. Remember, **Python uses 0-based indexing**, so the 16th row has an index of 15. You can either access only the 16th row to retrieve the 16th value, *or* you can select the 16th row and 1st column.

In [None]:
# Print the 16th value for Chins
chins_value = linnerud_df['Chins'][15]

print("16th value for Chins: ", chins_value)

16th value for Chins:  12.0


#### **Exercise #2:** What is the 9th value for `Situps`?

In [None]:
# Print the 9th value for Situps
linnerud_df['Situps'][8]

200.0

### **Exercise #3:** Print the entire fourth row.

In [None]:
# Print the entire fourth row
linnerud_df.loc[3]

Chins      12.0
Situps    105.0
Jumps      37.0
Name: 3, dtype: float64

### **Exercise #4:** Print the `Jumps` column.

In [None]:
# Print the Jumps column
jumps_values = linnerud_df['Jumps'] # extract the values for the "Jumps" column from the "performance" table
print("The Jumps column values are: ", jumps_values)   # print the values for the "Jumps" column

The Jumps column values are:  0      60.0
1      60.0
2     101.0
3      37.0
4      58.0
5      42.0
6      38.0
7      40.0
8      40.0
9     250.0
10     38.0
11    115.0
12    105.0
13     50.0
14     31.0
15    120.0
16     25.0
17     80.0
18     73.0
19     43.0
Name: Jumps, dtype: float64


### **Exercise #5:** Print the 12th-16th rows.

In [None]:
# Print the 12th-16th rows
rows_12_16 = linnerud_df.iloc[12:16, :]
print(rows_12_16)
rows_12_16 = linnerud_df.loc[12:15, :]
print(rows_12_16)

    Chins  Situps  Jumps
12   14.0   215.0  105.0
13    1.0    50.0   50.0
14    6.0    70.0   31.0
15   12.0   210.0  120.0
    Chins  Situps  Jumps
12   14.0   215.0  105.0
13    1.0    50.0   50.0
14    6.0    70.0   31.0
15   12.0   210.0  120.0


<a name="p2"></a>
## **Part 2: Exploring the mean, median and sum**
---
We will continue with the Linnerud dataset, exploring this time the mean, median and sum for each of the variables. 

These statistics can provide insights into the overall distribution of the data, which can be useful in understanding the characteristics of the dataset that help researchers make data-driven decisions.

### **Exercise #1:** Find the mean, median, and sum for `Jumps`.

The mean and median can give an idea of the typical or central value of a variable, while the sum can indicate the total amount of a variable across all observations. These statistics can be used to summarize the data for a report or visualize the data in a chart, as well as for inferential purposes (i.e. testing hypotheses or making predictions). 

In pandas, iloc uses 0-based indexing and includes the start position and excludes the end position, while loc uses label-based indexing and includes both the start and end positions.

So, when you use linnerud_df.iloc[11:16, :], you are selecting the rows with index 11, 12, 13, 14, and 15 (i.e., rows 12-16), while when you use linnerud_df.loc[11:15, :], you are selecting the rows with label (or index name) 11, 12, 13, 14, and 15 (i.e., rows 12-15).

However, in this case, the two statements will return the same result because the index of the rows in the dataframe is the same as the row labels. So, using either iloc or loc will give you the same output for this particular dataset.

Note that it is important to use the appropriate indexing method depending on your dataframe and what you want to accomplish. If the index is not the same as the row labels, using the wrong indexing method can result in unexpected behavior or errors.

In [None]:
# Mean, median and sum for Jumps
jumps_mean = linnerud_df['Jumps'].mean()   # calculate the mean for the "Jumps" column
jumps_median = linnerud_df['Jumps'].median()   # calculate the median for the "Jumps" column
jumps_sum = linnerud_df['Jumps'].sum()   # calculate the sum for the "Jumps" column

print("Mean Jumps:", jumps_mean)
print("Median Jumps:", jumps_median)
print("Sum Jumps:", jumps_sum)

Mean Jumps: 70.3
Median Jumps: 54.0
Sum Jumps: 1406.0


---
#### **Try Exercise #2 on your own!**
---

### **Exercise #2:** Find the mean, median, and sum for `Situps`.

In [None]:
# Mean, median and sum for Situps
jumps_mean = linnerud_df['Situps'].mean()   # calculate the mean for the "Jumps" column
jumps_median = linnerud_df['Situps'].median()   # calculate the median for the "Jumps" column
jumps_sum = linnerud_df['Situps'].sum()   # calculate the sum for the "Jumps" column

print("Mean Situps:", jumps_mean)
print("Median Situps:", jumps_median)
print("Sum Situps:", jumps_sum)

Mean Situps: 145.55
Median Situps: 122.5
Sum Situps: 2911.0


<a name="p3"></a>

## **Part 3: Reviewing EDA**
---
This dataset contains information on five students and their test scores in three subjects: math, English, and science. 

**Run the cell to import the dataset and create the dataframe.**

In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'Math Score': [75, 80, 90, 70, 85],
'English Score': [80, 70, 85, 90, 75],
'Science Score': [90, 85, 80, 75, 95]}

student_df = pd.DataFrame(data)
student_df

Unnamed: 0,Name,Math Score,English Score,Science Score
0,Alice,75,80,90
1,Bob,80,70,85
2,Charlie,90,85,80
3,David,70,90,75
4,Emma,85,75,95


### **Exercise #1:** How many rows are in the dataset?

In [None]:
num_rows = student_df.shape[0]
print("Number of rows in the dataset: ", num_rows)

Number of rows in the dataset:  5


### **Exercise #2:** How many columns are in the dataset?

In [None]:
num_cols = student_df.shape[1]
print("Number of columns in the dataset: ", num_cols)


What are the column names in the dataset

In [None]:
print("Column names in the dataset: ", student_df.columns)

Column names in the dataset:  Index(['Name', 'Math Score', 'English Score', 'Science Score'], dtype='object')


### **Exercise #4:** Print the entire second row.

In [None]:
second_row = student_df.iloc[1, :]   # extract the second row of the dataframe
print("The second row is:\n", second_row)   # print the second row

The second row is:
 Name             Bob
Math Score        80
English Score     70
Science Score     85
Name: 1, dtype: object


\n is an escape sequence in Python (and many other programming languages) that represents a new line character. It is used to insert a line break or newline in a string, so that the following text is printed on a new line.

### **Exercise #5:** Print the Math Score column.

In [None]:
math_scores = student_df['Math Score']   # extract the "Math Score" column of the dataframe
print("The Math Score column is:\n", math_scores)   # print the "Math Score" column

The Math Score column is:
 0    75
1    80
2    90
3    70
4    85
Name: Math Score, dtype: int64


### **Exercise #5:** Print the first three rows.

In [None]:
first_three_rows = student_df.head(3)   # extract the first three rows of the dataframe
print("The first three rows are:\n", first_three_rows)   # print the first three rows

The first three rows are:
       Name  Math Score  English Score  Science Score
0    Alice          75             80             90
1      Bob          80             70             85
2  Charlie          90             85             80


### **Exercise #6:** What is the dimension of the dataframe?

In [None]:
num_rows, num_cols = student_df.shape   # get the number of rows and columns of the dataframe
print("The dataframe has", num_rows, "rows and", num_cols, "columns.")   # print the dimensions of the dataframe

The dataframe has 5 rows and 4 columns.


### **Exercise #7:** Print basic statistical data for the dataset (mean, standard deviation, etc).

In [None]:
stats = student_df.describe()   # compute basic statistics for the dataframe
print("Basic statistical data for the dataset:\n", stats)   # print the statistics

Basic statistical data for the dataset:
        Math Score  English Score  Science Score
count    5.000000       5.000000       5.000000
mean    80.000000      80.000000      85.000000
std      7.905694       7.905694       7.905694
min     70.000000      70.000000      75.000000
25%     75.000000      75.000000      80.000000
50%     80.000000      80.000000      85.000000
75%     85.000000      85.000000      90.000000
max     90.000000      90.000000      95.000000


### **Exercise #8: What is the mean for Science Score?**

In [None]:
"""

" Answer here "


""";

In [None]:
science_score_mean = student_df['Science Score'].mean()   # compute the mean for the "Science Score" column
print("The mean for the Science Score column is:", science_score_mean)

The mean for the Science Score column is: 85.0


---
© 2023 The Coding School, All rights reserved