<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lab_2_%5BSOLUTIONS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab #2: Exploratory Data Analysis (EDA)**
---
**Description:**  Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves exploring and understanding the data to gain insights, identify patterns, and detect anomalies. EDA allows us to understand the underlying structure of the data, test assumptions, and prepare the data for further analysis. This lab will provide practice utilizing basic Pandas commands to explore different datasets, utilizing rows, columns and calculating the mean, median and sum, as part of the EDA process.
<br>


**Lab Structure:**

**Part 1**: [Review: Basic Pandas Commands](#p1)


> **Part 1.1**: [Review: Exploring rows and columns](#p1.1)

**Part 1.2**: [[Additional Practice] Exploratory Data Analysis (EDA) with Gapminder Data](#p4)

**Part 2**: [Data Cleaning Practice](#p2)

**Part 3**: [[OPTIONAL] Data Cleaning Practice](#p3)


</br>


**Goals:** By the end of this lab, you will:
* Be able to use basic Pandas commands.
* Be able to explore basic information about datasets.
* Know how to explore columns, rows, values, outliers, and more.
* Be able to remove missing values, or incorrect values.
* Practice data cleaning on real-world datasets.

</br> 

### **Cheat Sheets**
[EDA cheatsheet](https://drive.google.com/file/d/1ZZnIzgcT8dYcGwWVAR9DDFIwGXTGbIiU/view?usp=sharing)

[Pandas cheatsheet](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

<a name="p1"></a>
## **Part 1: Reviewing Basic Pandas Commands**

---
Let's practice a few of the Pandas commands we learned in the previous lab. 

**About the dataset:** The dataset is a small example dataset containing information about 5 individuals, including their name, age, gender, country of residence, and salary. 

**Run the code below to create a dictionary of sample data.**

In [None]:
 # import pandas as pd 
import pandas as pd

# Create a dictionary of sample data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'Age': [25, 30, 35, 40, 45],
        'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
        'Country': ['USA', 'Canada', 'UK', 'Australia', 'USA'],
        'Salary': [50000, 60000, 70000, 80000, 90000]}

# Create a Pandas dataframe from the dictionary
df = pd.DataFrame(data)

### **Exercise #1:** Print the first 5 rows of the dataframe.


In [None]:
# Print the first 5 rows of the dataframe


#### **Solution**

In [None]:
# Print the first 5 rows of the dataframe
print(df.head())

### **Exercise #2:** Print the column headers and data types of the dataframe.

In [None]:
# Print the column headers and data types of the dataframe


#### **Solution**

In [None]:
# Print the column headers and data types of the dataframe
print(df.info())

### **Exercise #3:** Print the column headings.

In [None]:
# Print the column headings
print(df.columns)

####  **Solution**

In [None]:
# Print the column headings
print(df.columns)

### **Exercise #4:** What is the shape (dimensions) of the dataframe?


In [None]:
# Print the shape of the dataframe


#### **Solution**

In [None]:
# Print the dimensions of the dataframe
print(df.shape)

### **Exercise #5:** What is the dimension of the dataframe?

In [None]:
# Print the dimension of the dataframe


#### **Solution**

In [None]:
# Print the dimension of the dataframe
print(df.ndim)

### **Exercise #6:** Print basic statistical data for the dataset (mean, standard deviation, etc).

#### **Solution**

In [None]:
print(df.describe())

### **Exercise #7:** How many counts of unique values are there in the `Gender` column?


In [None]:
# Print counts of unique values in the 'Gender' column
print(df['Gender'].value_counts())

#### **Solution**

In [None]:
print(df['Gender'].value_counts())

### **Exercise #8:** Print all unique values in the `Country` column.


In [None]:
# Unique values in 'Country'

#### **Solution**

In [None]:
print(df['Country'].unique())

---

<center>

#### **Back to lecture**

---

<a name="p2"></a>
## **Part 2: [Optional] Data Wrangling Practice**
---

Use the DataFrame below on NBA basketball players to answer Exercises #1-2. Take a moment to explore the created dataframe in the cell below, which contains the names of famous NBA players, ages, heights and their respective teams. 

**Remember to run the cell below to load the DataFrame before continuing onto the problems.**

In [None]:
#import numpy
import numpy as np
import pandas as pd

# create dataframe
df = pd.DataFrame(
  {'Name':['Giannis Antetokounmpo','Kevin Durant','Stephen Curry','Nikola Jokic', 'Joel Embiid'],
  'Age':[28, 34, 34, 27, np.nan],
  'Height (in)':[83, 82, 74, 83, np.nan],
  'Team':['Milwaukee Bucks', np.nan, 'Golden State Warriors', 'Denver Nuggets', np.nan] })
df

Unnamed: 0,Name,Age,Height (in),Team
0,Giannis Antetokounmpo,28.0,83.0,Milwaukee Bucks
1,Kevin Durant,34.0,82.0,
2,Stephen Curry,34.0,74.0,Golden State Warriors
3,Nikola Jokic,27.0,83.0,Denver Nuggets
4,Joel Embiid,,,


### **Exercise #1:** Use `isnull()` to see which values are missing.

#### **Solution**

In [None]:
df.isnull()

Unnamed: 0,Name,Age,Height (in),Team
0,False,False,False,False
1,False,False,False,True
2,False,False,False,False
3,False,False,False,False
4,False,True,True,True


---

<center>

#### **Back to lecture**

---

### **Exercise #2:** Since Joel Embiid's data is missing, drop the row. 

#### **Solution**

In [None]:
df.drop(index =[4])

Unnamed: 0,Name,Age,Height (in),Team
0,Giannis Antetokounmpo,28.0,83.0,Milwaukee Bucks
1,Kevin Durant,34.0,82.0,
2,Stephen Curry,34.0,74.0,Golden State Warriors
3,Nikola Jokic,27.0,83.0,Denver Nuggets


---

<center>

#### **Back to lecture**

---

<a name="p5"></a>

## **Part 3: Data Cleaning Practice [Optional]**
---
The given dataframe contains information about countries in North and South America, including their capital city, population, official language, GDP, and the name of the country. The dataframe has five columns and five rows, each row representing a different country.

In [None]:
import pandas as pd
import numpy as np

data = {
    'Country': ['USA', 'Canada', 'Mexico', 'Brazil', 'Argentina'],
    'Capital': ['Washington, D.C.', np.nan, 'Mexico City', 'Brasília', 'Buenos Aires'],
    'Population (millions)': [328, np.nan, 130, 211, 45],
    'Official Language': ['English', np.nan, 'Spanish', 'Portuguese', 'Spanish'],
    'GDP (trillions USD)': [21.44, 1.84, np.nan, 2.05, 0.45]
}

countries_df = pd.DataFrame(data)
countries_df


### **Exercise #1:** Identify which values are missing.

#### **Solution**

In [None]:
countries_df.isnull()

Unnamed: 0,Country,Capital,Population (millions),Official Language,GDP (trillions USD)
0,False,False,False,False,False
1,False,True,True,True,False
2,False,False,False,False,True
3,False,False,False,False,False
4,False,False,False,False,False


### **Exercise #2:** Drop any rows with missing values.

**Solution**

In [None]:
countries_df.dropna()

Unnamed: 0,Country,Capital,Population (millions),Official Language,GDP (trillions USD)
0,USA,"Washington, D.C.",328.0,English,21.44
3,Brazil,Brasília,211.0,Portuguese,2.05
4,Argentina,Buenos Aires,45.0,Spanish,0.45


### **Exercise #3:** Drop rows with missing values in the `Capital` column.

#### **Solution**

In [None]:
countries_df.dropna(subset=['Capital'])

Unnamed: 0,Country,Capital,Population (millions),Official Language,GDP (trillions USD)
0,USA,"Washington, D.C.",328.0,English,21.44
2,Mexico,Mexico City,130.0,Spanish,
3,Brazil,Brasília,211.0,Portuguese,2.05
4,Argentina,Buenos Aires,45.0,Spanish,0.45


### **Exercise #4:** Drop the `Population (millions)` column.

#### **Solution**

In [None]:
countries_df.drop(columns='Population (millions)')

Unnamed: 0,Country,Capital,Official Language,GDP (trillions USD)
0,USA,"Washington, D.C.",English,21.44
1,Canada,,,1.84
2,Mexico,Mexico City,Spanish,
3,Brazil,Brasília,Portuguese,2.05
4,Argentina,Buenos Aires,Spanish,0.45


### **Exercise #5:** Drop the second row.

#### **Solution**

In [None]:
countries_df.drop(index=1)

Unnamed: 0,Country,Capital,Population (millions),Official Language,GDP (trillions USD)
0,USA,"Washington, D.C.",328.0,English,21.44
2,Mexico,Mexico City,130.0,Spanish,
3,Brazil,Brasília,211.0,Portuguese,2.05
4,Argentina,Buenos Aires,45.0,Spanish,0.45


### **Exercise #6:**  Calculate the mean of the `GDP (trillions USD)` column and then fill missing values with the mean of the column.

In [None]:
mean_gdp = countries_df['GDP (trillions USD)'].mean()
countries_df['GDP (trillions USD)'] = countries_df['GDP (trillions USD)'].fillna(value=mean_gdp)

### **Exercise #7:** Replace `Mexico City` with `CDMX` in the `Capital` column

#### **Solution**

In [None]:
countries_df = countries_df.rename(columns={'Official Language': 'Language'})

### **Exercise #8:** What is the data type for each of the columns?

#### **Solution**

In [None]:
print(countries_df.dtypes)

Country                   object
Capital                   object
Population (millions)    float64
Language                  object
GDP (trillions USD)      float64
dtype: object


### **Exercise #9:** Insert a new column called `Population Density`.

#### **Solution**

In [None]:
countries_df.insert(3, 'Population Density', countries_df['Population (millions)'] / 3)

---
© 2023 The Coding School, All rights reserved