<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lab_2_%5BSTUDENTS%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab #2: Exploratory Data Analysis (EDA)**
---
### **Description:**  Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves exploring and understanding the data to gain insights, identify patterns, and detect anomalies. EDA allows us to understand the underlying structure of the data, test assumptions, and prepare the data for further analysis. This lab will provide practice utilizing basic Pandas commands to explore different datasets, utilizing rows, columns and calculating the mean, median and sum, as part of the EDA process.
<br>


### **Lab Structure**
**Part 1**: [Review: Basic Pandas Commands](#p1)

**Part 2**: [Data Wrangling Practice](#p2)

**Part 3**: [[OPTIONAL] Data Cleaning Practice](#p3)

**Part 4**: [[Additional Practice] Exploratory Data Analysis (EDA) with Gapminder Data](#p4)


</br>


**Goals:** By the end of this lab, you will:
* Be able to use basic Pandas commands.
* Be able to explore basic information about datasets.
* Know how to explore columns, rows, values, outliers, and more.
* Be able to remove missing values, or incorrect values.
* Practice data cleaning on real-world datasets.

</br> 

### **Cheat Sheets**
[EDA cheatsheet](https://drive.google.com/file/d/1ZZnIzgcT8dYcGwWVAR9DDFIwGXTGbIiU/view?usp=sharing)

[Pandas cheatsheet](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

<a name="p1"></a>
## **Part 1: Review: Basic Pandas Commands**

---
Let's practice a few of the Pandas commands we learned in the previous lab. 

**About the dataset:** The dataset is a small example dataset containing information about 5 individuals, including their name, age, gender, country of residence, and salary. 

**Run the code below to create a dictionary of sample data.**

In [None]:
 # import pandas as pd 
import pandas as pd

# Create a dictionary of sample data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'Age': [25, 30, 35, 40, 45],
        'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],
        'Country': ['USA', 'Canada', 'UK', 'Australia', 'USA'],
        'Salary': [50000, 60000, 70000, 80000, 90000]}

# Create a Pandas dataframe from the dictionary
df = pd.DataFrame(data)

### **Exercise #1:** Print the first 5 rows of the dataframe.


In [None]:
# Print the first 5 rows of the dataframe


### **Exercise #2:** Print the column headers and data types of the dataframe.

In [None]:
# Print the column headers and data types of the dataframe


### **Exercise #3:** Print the column headings.

In [None]:
# Print the column headings
print(df.columns)

### **Exercise #4:** What is the shape (dimensions) of the dataframe?


In [None]:
# Print the shape of the dataframe


### **Exercise #5:** What is the dimension of the dataframe?

In [None]:
# Print the dimension of the dataframe


### **Exercise #6:** Print basic statistical data for the dataset (mean, standard deviation, etc).

### **Exercise #7:** How many counts of unique values are there in the 'Gender' column?


In [None]:
# Print counts of unique values in the 'Gender' column
print(df['Gender'].value_counts())

### **Exercise #8:** Print all unique values in the 'Country' column.


In [None]:
# Unique values in 'Country'

---

<center>

#### **Back to lecture**

---

<a name="p2"></a>
## **Part 2: Data Wrangling Practice**
---

Use the DataFrame below on NBA basketball players to answer Problems #1-2. Take a moment to explore the created dataframe in the cell below, which contains the names of famous NBA players, ages, heights and their respective teams. 

**Remember to run the cell below to load the DataFrame before continuing onto the problems.**

In [None]:
#import numpy
import numpy as np
import pandas as pd

# create dataframe
df = pd.DataFrame(
  {'Name':['Giannis Antetokounmpo','Kevin Durant','Stephen Curry','Nikola Jokic', 'Joel Embiid'],
  'Age':[28, 34, 34, 27, np.nan],
  'Height (in)':[83, 82, 74, 83, np.nan],
  'Team':['Milwaukee Bucks', np.nan, 'Golden State Warriors', 'Denver Nuggets', np.nan] })
df

### **Exercise #1:** Use `isnull()` to see which values are missing.

### **Exercise #2:** Since Joel Embiid's data is missing, drop the row. 

**Question:** What are the variables that we're working with in the above dataset?

### **Exercise #3:** Create a new dataframe that only includes rows with players who are over 30 years old.

In [None]:
# create dataframe
students_df = pd.DataFrame(
  {'name':['Jen','Akiro','Jamil','Benny', 'Aster', 'Raj', 'Alisha'],
  'age':[19, 18, 21, 23, 26, np.nan, 30],
   'gpa':[np.nan, 4.0, 3.0, 2.3, np.nan, 3.9, 3.8],
   'year':['Freshman', 'Freshman', 'Junior', 'Junior', 'Senior', 'Sophomore', 'Senior'] })
students_df

### **Exercise #4:** Add a new column called "Salary" with values `[$35,000, $42,000, $38,000, $40,000, $28,000]` and display the updated dataframe.

**New Dataset Alert!**

The following students have applied for an on-campus university job as a Research Assistant. However, some data is missing. It is your job to fix the missing data. Use this data to answer Problems #3-6.

**Remember to run the cell below to load the DataFrame before continuing onto the problems.**

### **Exercise #5:** Raj's age is missing. Use the mean for the non-missing values in the column `age` to replace the missing value with the mean. 

### **Exercise #6:** Jen and Aster's GPA are missing. Use the median value of `gpa` to replace these missing values.

### **Exercise #7:** Use the `students_df` DataFrame and rename `gpa` to `GPA`. 

### **Exercise #8:** Rename the rest of columns so all column names begin with a capital letter.

<a name="p3"></a>
## [OPTIONAL] **Part #3: Data Cleaning Practice**
---
#### **Exercise #1: Remove the missing values from the following DataFrame:**


In [None]:
# create dataframe
df = pd.DataFrame(
  {'Company':['Google','Amazon','Infosys','Directi'],
  'Age':['21','23','38','22'],
  'NaNetWorth ($ bn)':[300, np.nan, np.nan, 1.3],
  'Founder':[np.nan, np.nan, np.nan, np.nan],
  'Headquarter-Country':['United States', np.nan, 'India', 'India'] })
df

In [None]:
# Remove Founder Feature

# Remove row 1 and 2



<a name="p4"></a>
## [Additional Practice] **Part #4: Exploratory Data Analysis (EDA) using the Gapminder dataset**
---

## **Part 4.1: EDA**
This part focuses on Exploratory Data Analysis (EDA) using the Gapminder dataset, which contains socio-economic indicators for different countries over time.

First, we need to install the gapminder package to access the dataset. **Run the next cell to install the packages and create the dataframe.**

In [None]:
!pip install gapminder

from gapminder import gapminder

# Create dataframe
gapminder_df = pd.DataFrame(gapminder)
gapminder_df.head()

### **Exercise #1:** How many countries are there in the dataset?

In [None]:
unique_countries = # YOUR CODE HERE #

print(f"There are {unique_countries} countries in the dataset.")

### **Exercise 2:** What is the time range?

In [None]:

print(f"The time range is from {time_range[0]} to {time_range[1]}.")

### **Exercise #3:**  What is the average life expectancy for Japan in 2007?

### **Exercise #4:** Print the entire row for the United States in 1962.

### **Exercise #5:** What is the global mean life expectancy in 1997?

### **Exercise #6:** What is the median GDP per capita globally in 1982?

### **Exercise #7:** What is the total population of Asia in 1957?

### [OPTIONAL] **Exercise #8:** What is the correlation between GDP per capita and life expectancy globally in 2007?

### [OPTIONAL] **Exercise #9:**  What is the correlation between population and GDP per capita in Asia in 2002?

### [OPTIONAL] **Exercise #10:** Which country had the highest GDP per capita in 1952, and what was its value?

## **Part 4.2: Insights from the EDA on the Gapminder dataset.**
---
The analysis of the Gapminder dataset provides valuable insights into various aspects of the socio-economic conditions of countries across different parts of the world. We learned that the dataset includes 142 unique countries and covers a time range from 1952 to 2007. We investigated correlations between different variables, such as GDP per capita and life expectancy, and found that in 2007, there was a positive correlation between the two variables globally, indicating that higher GDP per capita was generally associated with higher life expectancy.

**Reflection question: How can the insights gained from analyzing the Gapminder dataset inform decision-making processes in the future?**

**Enter your answer in the below code cell.**

In [None]:
"""

WRITE YOUR ANSWER HERE

""";

**Congratulations on finishing the notebook!** In this lab, you practiced `Pandas` commands you learned yesterday and worked with columns, rows, and explored the mean, median, and sum to further get instights into the data. Additionally, you learned to analyse diverse datasets by utilizing data cleaning techniques to clear missing or incorrect values and gain valuable insights.

---
© 2023 The Coding School, All rights reserved