<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="380" alt="Skills Network Logo">
    </a>
</p>


# **Exploratory Data Analysis Lab**


Estimated time needed: **30** minutes


In this module you get to work with the cleaned dataset from the previous module.

In this assignment you will perform the task of exploratory data analysis.
You will find out the distribution of data, presence of outliers and also determine the correlation between different columns in the dataset.


## Objectives


In this lab you will perform the following:


-   Identify the distribution of data in the dataset.

-   Identify outliers in the dataset.

-   Remove outliers from the dataset.

-   Identify correlation between features in the dataset.


* * *


## Hands on Lab


Import the pandas module.


In [None]:
import pandas as pd
import numpy as np
from matplotlib.pyplot import matplotlib as plt

Load the dataset into a dataframe.


<h2>Read Data</h2>
<p>
We utilize the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


The functions below will download the dataset into your browser:


In [None]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

In [None]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv"

To obtain the dataset, utilize the download() function as defined above:  


In [None]:
await download(file_path, "m2_survey_data.csv")
file_name="m2_survey_data.csv"

Utilize the Pandas method read_csv() to load the data into a dataframe.


In [None]:
df = pd.read_csv(file_name)

> Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply **skip the steps above,** and simply use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.


In [None]:
#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")

## Distribution


### Determine how the data is distributed


The column `ConvertedComp` contains Salary converted to annual USD salaries using the exchange rate on 2019-02-01.

This assumes 12 working months and 50 working weeks.


Plot the distribution curve for the column `ConvertedComp`.


In [None]:
# your code goes here

df['ConvertedComp'].plot(kind='bar', title='ConvertedComp Curve')

Plot the histogram for the column `ConvertedComp`.


In [None]:
# your code goes here
df['ConvertedComp'].plot(kind = 'hist', title = "Annual USD Salaries")


What is the median of the column `ConvertedComp`?


In [None]:
# your code goes here
df['ConvertedComp'].groupby(level=0).median()


How many responders identified themselves only as a **Man**?


In [None]:
# your code goes here
df['Gender'].value_counts()['Man']


Find out the  median ConvertedComp of responders identified themselves only as a **Woman**?


In [None]:
# your code goes here
women_df = df[df['Gender'] == 'Woman']
median_converted_comp = women_df['ConvertedComp'].median()
median_converted_comp


Give the five number summary for the column `Age`?


**Double click here for hint**.

<!--
min,q1,median,q3,max of a column are its five number summary.
-->


In [None]:
# your code goes here
min=df.Age.min()
Quantile1=df.Age.quantile(0.25)
median=df.Age.median()
Quantile3=df.Age.quantile(0.75)
max=df.Age.max()

min,Quantile1,median,Quantile3,max

Plot a histogram of the column `Age`.


In [None]:
# your code goes here
df['Age'].plot(kind='hist',title= "Age")


## Outliers


### Finding outliers


Find out if outliers exist in the column `ConvertedComp` using a box plot?


In [None]:
# your code goes here
df['ConvertedComp'].plot.box()


Find out the Inter Quartile Range for the column `ConvertedComp`.


In [None]:
# your code goes here
Q1=df.ConvertedComp.quantile(0.25)
Q3=df.ConvertedComp.quantile(0.75)
Q1,Q3

Find out the upper and lower bounds.


In [None]:
# your code goes here
IQR=Q3-Q1
IQR
lower_bound=Q1-1.5*IQR
upper_bound=Q3+1.5*IQR
lower_bound,upper_bound

Identify how many outliers are there in the `ConvertedComp` column.


In [None]:
# your code goes here
df[(df.ConvertedComp<(lower_bound))|(df.ConvertedComp>(upper_bound))]


Create a new dataframe by removing the outliers from the `ConvertedComp` column.


In [None]:
# your code goes here
df_no_outliers= df[(df['Age'] >= lower_bound) & (df['Age'] <= upper_bound)]
df_no_outliers

## Correlation


### Finding correlation


Find the correlation between `Age` and all other numerical columns.


In [None]:
# your code goes here
numeric_df = df.apply(pd.to_numeric, errors='coerce')
correlations = numeric_df.corr()['Age']
correlations = correlations.dropna()
correlations

## Authors


Ramesh Sannareddy


### Other Contributors


Rav Ahuja


 Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).


<!--## Change Log


<!--| Date (YYYY-MM-DD) | Version | Changed By        | Change Description                 |
| ----------------- | ------- | ----------------- | ---------------------------------- |
| 2020-10-17        | 0.1     | Ramesh Sannareddy | Created initial version of the lab |--!>
