---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.10(Pandas-01)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

## _From Python Dictionary to Pandas Dataframe.ipynb_

## 1. Overview of Pandas Library and its Datastructures

<img align="center" width="700" height="700"  src="images/pandas-apps.png"  >

> **Pandas is an open source python library built on top of numpy and provides easy to use data structures and data analysis tools. Pandas has derived its name from panel data system and was developed by wes mckinney in 2008.**

> **Data scientists use pandas for performing various data science tasks starting from downloading, opening, reading and writing files of different file formats like csv, excel, json, html and so on. They load the data set into its data structure called data frame.**

> **A Pandas Dataframe is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**

> **After the data is loaded in a data frame data scientists perform a various data manipulation tasks like filtering and modifying data based on multiple conditions cutting splitting merging sorting scaling pivoting and aggregating of data.**

> **Data cleaning is done to enhance the data accuracy and integrity by identifying and removing null values duplicates and outliers.**

> **Data wrangling actually transforms the data structurally to appropriate format and makes it ready to be used by the machine learning engineers so that they can apply appropriate machine learning models or algorithm on that data set for training validating and testing purposes.**

## Learning agenda of this notebook

1. Overview of Pandas Library and its Data Structures
2. Install Pandas Library
3. Read Datasets into Pandas Dataframe
4. Python Dictionaries vs Pandas Dataframes
5. Anatomy of a Dataframe
6. Bonus

## 2. Install Pandas Library

In [None]:
# To install this library in Jupyter notebook
import sys
!{sys.executable} -m pip install pandas --quiet

In [None]:
import pandas as pd
pd.__version__ , pd.__path__

## 3. Read Data Sets into Pandas Dataframe

### a. Titanic Dataset
- Here is link to [titanic movie](https://hdtoday.cc/movie/watch-titanic-full-19586)
- This dataset containing information about passengers offered the titanic ship that sank in north atlantic ocean in 1912 after striving striking with an iceberg.
- Characteristics of dataset
   - plcass : passengers class like CLASS A, CLASS B, CLASS C. 
   - survived : 1 means alive and 0 means died.
   - name : name of each passenger
   - sex : gender
   - age : passenger's age
   - sibsp : number of siblings and spouses(mainya biwi) on board
   - parch : number of parents and children on board
   - cabin : cabin in which they are residing
   - embarked : Location from where they have embarked like southampton, cherbourg , queenstown.
   
- This is normally the first problem that machine learning students normally perform in which they answer the  questions like what sort of people were more likely to survive in that titanic thinking by applying some predictive model like reggresion.

In [None]:
df_titanic = pd.read_csv('datasets/titanic3.csv')
df_titanic

### b. IMDB Dataset
- This is internet movie database containing movie ratings. Here , only all imdb registered users can cast a vote on every released title.
- By using this dataset we can can answer a lot of questions like movies of a specific actor ,the list of comedy movies or crime movies, movies that are not for children and movies with released year.

In [None]:
df_imdb = pd.read_csv('datasets/imdb.csv')
df_imdb

### c. Covid Dataset
- This covet 19 disease data set is collected by WHO and contains daily updates on differentstatistics of the disease in different countries all around the global.

In [None]:
df_covid =pd.read_csv('datasets/covid-data.csv')
df_covid

**We have seen the idea of pandas data frame which is a data structure capable of loading this huge amount of data inside the computer memory and allows you to wrangle and clean this data perform different data analytics and finally design and train machine learning models to do prediction and forecasting.**

## 4. Python Dictionaries vs Pandas Dataframes

In [None]:
person = {
    "name" : "Ehtisham",
    "age" : 21,
    "address" : "Lahore",
    "cell" : "0320-431",
    "bg": "A-"
}
person

In [None]:
people = {
    "name" : ["Ehtisham", "Ali", "Ayesha", "Dua", "Khubaib", "Adeen"],
    "age" : [21, 20, 18, 17, 12, 10],
    "address": ["Lahore", "Karachi", "Lahore", "Islamabad", "Kakul", "Karachi"],
    "cell" : ["321-123", "320-431", "321-478", "324-446", "321-967", "320-678"],
    "bg": ["B+", "A-", "B+", "O-", "A-", "B+"]
}
people

In [None]:
! cat datasets/people.csv

In [None]:
import pandas as pd
df_people = pd.read_csv('datasets/people.csv')
df_people

**Accessing Elements of Dictionaries and Dataframes**

In [None]:
people

In [None]:
# First Method
people.get('age')

In [None]:
mylist = people['name']
mylist

In [None]:
type(mylist)

In [None]:
df_people[' age']

In [None]:
myseries = df_people['name']
myseries

In [None]:
type(myseries)

In [None]:
df_people.index

In [None]:
df_people.columns

## 5. Anatomy of a Dataframe
<img align="center" width="800" height="500"  src="images/dataframe.png"  >

**Here each column of this dataframe is actually a series object since our data frame is 2d so we have two x's axis 0 is the vertical axis that moves from top to bottom and we have records or rows along this axis similarly the axis 1 is the horizontal axis that moves from left to right and the columns changes as we move from left to right and this is the same concept that we have discussed in the numpy 2d arrays as well and at the intersection of each row and column we have our data valus.**

## 6. Bonus (Changing display properties of a Dataframe Object)
- Get the option names `pd.describe_option()`
- Get current value of a display option `pd.get_option('nameofoption')`
- Change value of a display option `pd.set_option('nameofoption', newvalue)`
- Resetting options to default `pd.reset_option('all')

In [None]:
df_covid

### a. Changing the number of columns to be displayed

In [None]:
pd.describe_option()

In [None]:
pd.get_option('display.max_columns')

In [None]:
pd.set_option('display.max_columns', 50)

In [None]:
df_covid

### b. Changing the number of rows to be displayed

**Display 6 rows instead of default of 10**

In [None]:
pd.get_option('display.min_rows')

In [None]:
pd.set_option('display.min_rows', 8)

**Display 30 rows instead of default of 10**

In [None]:
pd.set_option('display.min_rows', 30)

In [None]:
df_covid

>Students should check out the relationship of option `max_rows` with `min_rows` at your own

### c. Changing Number of Characters to be Displayed in each Column

In [None]:
df_imdb

In [None]:
pd.get_option('display.max_colwidth')

In [None]:
pd.set_option('display.max_colwidth', 200)

In [None]:
pd.get_option('display.max_colwidth')

In [None]:
df_imdb

### c. Setting the options back to default

In [None]:
pd.reset_option('all')

### d. Changing Style by applying CSS to Pandas Dataframe

In [None]:
df_titanic.head().style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
 {'selector': 'tr:hover',
  'props': [('background', 'pink')]},
 
]
).hide_index()

## Project : Visualizing Earnings Based On College Majors

### 1. Introduction
* Pandas has many methods for quickly generating common plots from data in DataFrames. Like pyplot, the plotting functionality in pandas is a wrapper for matplotlib. This means we can customize the plots when necessary by accessing the underlying Figure, Axes, and other matplotlib objects.

* In this  project, we'll explore how using the pandas plotting functionality along with the Jupyter notebook interface allows us to explore data quickly using visualizations.

* We'll be working with a dataset on the job outcomes of students who graduated from college between 2010 and 2012. The original data on job outcomes was released by [American Community Survey](https://www.census.gov/programs-surveys/acs/), which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their [Github repo](https://github.com/fivethirtyeight/data/tree/master/college-majors).

. Here are some of the columns in the dataset:

* Rank - Rank by median earnings (the dataset is ordered by this column).
* Major_code - Major code.
* Major - Major description.
* Major_category - Category of major.
* Total - Total number of people with major.
* Sample_size - Sample size (unweighted) of full-time.
* Men - Male graduates.
* Women - Female graduates.
* ShareWomen - Women as share of total.
* Employed - Number employed.
* Median - Median salary of full-time, year-round workers.
* Low_wage_jobs - Number in low-wage service jobs.
* Full_time - Number employed 35 hours or more.
* Part_time - Number employed less than 35 hours.

Using visualizations, we can start to explore questions from the dataset like:

* Do students in more popular majors make more money?
  * Using scatter plots
* How many majors are predominantly male? Predominantly female?
  * Using histograms
* Which category of majors have the most students?
  * Using bar plots

## TODO:
* Let's setup the environment by importing the libraries we need and running the necessary Jupyter magic so that plots are displayed inline.

  * Import `pandas` and `matplotlib` into the environment.
  * Run the Jupyter magic `%matplotlib inline` so that plots are displayed inline.
  * Read the dataset into a DataFrame and start exploring the data.

* Read `recent-grads.csv` into pandas and assign the resulting DataFrame to recent_grads.
  * Use `DataFrame.iloc[]` to return the first row formatted as a table.
  * Use `DataFrame.head()` and `DataFrame.tail()` to become familiar with how the data is structured.
  * Use `DataFrame.describe()` to generate summary statistics for all of the numeric columns.

* Drop rows with missing values. Matplotlib expects that columns of values we pass in have matching lengths and missing values will cause matplotlib to throw errors.

  * Look up the number of rows in `recent_grads` and assign the value to `raw_data_count`.
  * Use `DataFrame.dropna()` to drop rows containing missing values and assign the resulting DataFrame back to `recent_grads`.
  * Look up the number of rows in `recent_grads` now and assign the value to `cleaned_data_count`. If you compare cleaned_data_count and raw_data_count, you'll notice that only one row contained missing values and was dropped.

In [None]:
# import necessary libraries
%matplotlib inline
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
# read the dataset into dataframe
url = "https://raw.githubusercontent.com/AnshuTrivedi/Data-Scientist-In-Python/master/Projects/step_2/Course_2/recent-grads.csv"
recent_grads = pd.read_csv(url)
recent_grads.head()

In [None]:
recent_grads.tail()

In [None]:
recent_grads.dtypes

In [None]:
recent_grads.head(1)

In [None]:
# returns the first row of the dataframe
recent_grads.iloc[0,:]

In [None]:
# generate descriptive summary of the data
recent_grads.describe()

In [None]:
# check the columns which contains missing values
recent_grads.isna().sum()

In [None]:
recent_grads.shape

In [None]:
# check the count of rows in the dataset
raw_data_count = recent_grads.shape[0]
raw_data_count

In [None]:
recent_grads.dropna()

In [None]:
# drop all the rows which contain missing data using drop() method
recent_grads=recent_grads.dropna()

In [None]:
cleaned_data_count = recent_grads.shape[0]
print("No. of rows in raw data: ", raw_data_count)
print("No. of rows in cleaned data : ", cleaned_data_count)

### 2. Pandas, Scatter Plots
* Most of the plotting functionality in pandas is contained within the `DataFrame.plot()` method. When we call this method, we specify the data we want plotted as well as the type of plot. We use the `kind parameter` to specify the type of plot we want. We use `x` and `y` to specify the data we want on each axis. 

`recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs. Sample_size', figsize=(5,10))`

We can access the underlying matplotlib Axes object by assigning the return value to a variable:

`ax = recent_grads.plot(x='Sample_size', y='Employed', kind='scatter')
ax.set_title('Employed vs. Sample_size')`


In [None]:
# let's see the above mentioned plot
recent_grads.plot(x='Sample_size', y='Employed', kind='scatter', title='Employed vs Sample_size', figsize=(5,10))

#### TODO:
* Generate scatter plots in separate jupyter notebook cells to explore the following relations:
  * Sample_size and Median
  * Sample_size and Unemployment_rate
  * Full_time and Median
  * ShareWomen and Unemployment_rate
  * Men and Median
  * Women and Median
* Use the plots to explore the following questions:
  * Do students in more popular majors make more money?
  * Do students that majored in subjects that were majority female make more money?
  * Is there any link between the number of full-time employees and median salary?

In [None]:
recent_grads.head(1)

In [None]:
# print name of all columns
recent_grads.columns

In [None]:
recent_grads.Major_category.unique()

In [None]:
# import pandas as pd
# pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', 500)
# pd.set_option('display.width', 1000)

In [None]:
a = recent_grads.groupby(['Major_category','Major','Rank'])[['Median']].max().sort_values(by='Median', ascending=False)
a

In [None]:
a = a.reset_index(drop=False)
a.head(20)

- **Here we can see that students in popular major makes more money.**

#### Do students that majored in subjects that were majority female make more money?

In [None]:
new_recent_grads = recent_grads[['Rank','Major','Total','Men','Women','Major_category','Median']]
new_recent_grads.head(10)

In [None]:
new_recent_grads['Men_Percentage'] = (new_recent_grads['Men']/recent_grads['Total'])*100
new_recent_grads['Women_Percentage'] = (new_recent_grads['Women']/recent_grads['Total'])*100

new_recent_grads.head(10)

In [None]:
new_recent_grads['Men_More_Per_Than_Women']=new_recent_grads.Men_Percentage > new_recent_grads.Women_Percentage
new_recent_grads.head(10)

- **Here we can see that the students that majored in subjects that were  majority male make more money.**

#### Is there any link between the number of full-time employees and median salary?

In [None]:
# First Method
recent_grads[['Full_time','Median']].corr()

In [None]:
# Second Method
comp_recent_grads = recent_grads[['Full_time','Median']].sort_values('Full_time', ascending=False)
comp_recent_grads.head(10)

- **Here we can see that there is no link between Full_time and Median.**

In [None]:
# Generate scatter plots in separate jupyter notebook cells to explore the following relations:

#     Sample_size and Median
#     Sample_size and Unemployment_rate
#     Full_time and Median
#     ShareWomen and Unemployment_rate
#     Men and Median
#     Women and Median


In [None]:
# First Method
recent_grads[['Sample_size','Median']].corr()

In [None]:
# Second Method
recent_grads.plot(x='Sample_size', y='Median', kind='scatter', title="Sample_size vs Median", figsize=(5,10))

**Here from the picture, we can observe that there is no relation between Sample_Size and Median**

In [None]:
# First Method
recent_grads[['Sample_size','Unemployment_rate']].corr()

In [None]:
# Second Method
recent_grads.plot(x='Sample_size',y='Unemployment_rate', kind='scatter', figsize=(10,5))

**Here from the picture, we can observe that there is no relation between Sample_Size and Unemployment_rate**

In [None]:
# First Method
recent_grads[['Full_time','Median']].corr()

In [None]:
# Second Method
recent_grads.plot(x='Full_time',y='Median', kind='scatter', figsize=(10,5))

**Here from the picture, we can observe that there is no relation between Full_time and Median**

In [None]:
# First Method
recent_grads[['ShareWomen','Unemployment_rate']].corr()

In [None]:
# Second Method
recent_grads.plot(x='ShareWomen',y='Unemployment_rate', kind='scatter', figsize=(10,5))

**Here from the picture, we can observe that there is no relation between Sharewomen and unemployment_rate**

In [None]:
# First Method
recent_grads[['Men','Median']].corr()

In [None]:
# Second Method
recent_grads.plot(x='Men',y='Median', kind='scatter', figsize=(10,5))

In [None]:
# First Method
recent_grads[['Women','Median']].corr()

In [None]:
# Second Method
recent_grads.plot(x='Women',y='Median', kind='scatter', figsize=(10,5))

### Correlation Practice Examples
<img src="https://cdn1.byjus.com/wp-content/uploads/2021/03/Correlation.png" height=500px width=500px>

<img src="https://cdn.scribbr.com/wp-content/uploads/2021/08/01-correlation-types-1024x415.png" height=500px width=500px>

### 3. Pandas, Histograms
To explore the distribution of values in a column, we can select it from the DataFrame, call Series.plot(), and set the kind parameter to hist:

`recent_grads['Sample_size'].plot(kind='hist')`

#### TODO:
* Generate histograms in separate jupyter notebook cells to explore the distributions of the following columns:
  * Sample_size
  * Median
  * Employed
  * Full_time
  * ShareWomen
  * Unemployment_rate
  * Men
  * Women
* We encourage you to experiment with different bin sizes and ranges when generating these histograms.
* Use the plots to explore the following questions:
  * What percent of majors are predominantly male? Predominantly female?
  * What's the most common median salary range?

In [None]:
# Let's see the answers of questions

In [None]:
new_recent_grads.head(5)

In [None]:
# Major where females predominant
new_recent_grads[new_recent_grads.Women_Percentage > new_recent_grads.Men_Percentage].head(10)

In [None]:
# What's the most common median salary range?
new_recent_grads.Median.describe()

In [None]:
# Histogram of Sample_size
recent_grads.Sample_size.plot(kind='hist')

In [None]:
recent_grads.Sample_size.plot(kind='hist', bins=50)

In [None]:
# Histogram of Median
recent_grads.Median.plot(kind='hist')

In [None]:
recent_grads.Median.plot(kind='hist', bins=30)

In [None]:
recent_grads.Median.plot(kind='kde')

In [None]:
Employed
Full_time
ShareWomen
Unemployment_rate
Men
Women

In [None]:
# Histogram of Employed
recent_grads.Employed.plot(kind='hist')

In [None]:
recent_grads.Employed.plot(kind='hist', bins=30)

In [None]:
# Histogram of Full_time
recent_grads.Full_time.plot(kind='hist')

In [None]:
recent_grads.Full_time.plot(kind='hist', bins=30)

In [None]:
# Histogram of SharedWomen
recent_grads.ShareWomen.plot(kind='hist')

In [None]:
recent_grads.ShareWomen.plot(kind='hist', bins=40)

In [None]:
# Histogram of Unemployment_rate
recent_grads.Unemployment_rate.plot(kind='hist')

In [None]:
recent_grads.Unemployment_rate.plot(kind='hist', bins=30)

In [None]:
# Histogram of men
recent_grads.Men.plot(kind='hist')

In [None]:
recent_grads.Men.plot(kind='hist', bins=40)

In [None]:
# Histogram of women
recent_grads.Women.plot(kind='hist')

In [None]:
recent_grads.Women.plot(kind='hist', bins=40)

### 4. Pandas, Scatter Matrix Plot
* In the last 2 steps, we created individual scatter plots to visualize potential relationships between columns and histograms to visualize the distributions of individual columns. **A scatter matrix plot combines both scatter plots and histograms into one grid of plots and allows us to explore potential relationships and distributions simultaneously**

Because scatter matrix plots are frequently used in the `exploratory data analysis`, pandas contains a function named `scatter_matrix()` that generates the plots for us. This function is part of the `pandas.plotting` module and needs to be imported separately.

#### TODO:
* Import scatter_matrix from the pandas.plotting module.
* Create a 2 by 2 scatter matrix plot using the Sample_size and Median columns.
* Create a 3 by 3 scatter matrix plot using the Sample_size, Median, and Unemployment_rate columns.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
# Create a 2 by 2 scatter matrix plot using the Sample_size and Median columns.
matrix = recent_grads[['Sample_size','Median']]
scatter_matrix(matrix, figsize=(7,7))

In [None]:
# Create a 2 by 3 scatter matrix plot using the Sample_size ,Median columns and Unemployment_rate.
matrix = recent_grads[['Sample_size','Median','Unemployment_rate']]
scatter_matrix(matrix, figsize=(9,9))

### 5. Pandas, Bar Plots
* If we instead use the `DataFrame.plot.bar()` method, we can use the `x parameter to specify the labels` and the `y parameter to specify the data for the bars`.

#### TODO:

* Use bar plots to compare the percentages of women  from the first ten rows and last ten rows of the recent_grads dataframe along `Major`.
* Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and last ten rows of the recent_grads dataframe along `Major`.

In [None]:
new_df = pd.merge(recent_grads.head(10),recent_grads.tail(10), how='outer')
new_df.sample(3)

In [None]:
# Use bar plots to compare the percentages of women from the first ten rows and 
#last ten rows of the recent_grads dataframe along Major
new_df.plot.barh(x='Major',y='Women', figsize=(8,8))

In [None]:
# Use bar plots to compare the unemployment rate (Unemployment_rate) from the first ten rows and 
# last ten rows of the recent_grads dataframe along Major.
new_df.plot.bar(x='Major',y='Unemployment_rate', figsize=(8,8))

In [None]:
new_df.plot.barh(x='Major',y='Unemployment_rate', figsize=(8,8))