---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.09-01(Pandas-00)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

# _Python_Pandas_Introduction.ipynb_

<img align="center" width="700" height="700"  src="images/pandas-apps.png"  >

>-  **A Pandas Dataframe is a 2-dimensional labeled data structure (like SQL table) with heterogeneously typed columns, having both a row and a column index.**
>-  **In short Pandas is a Software Libarary in Computer Programming and it is written for the Python Programming Language its work to do `data analysis and manipulation.`**

## So, what is Pandas and how is it used in AI?

Artificial Intelligence is about executing machine learning algorithms on products that we use every day. Any ML algorithm, for it to be effective, needs the following prerequisite steps to be done.
- `Data Collection` – Conducting opinion Surveys, scraping the internet, etc.
- `Data Handling` – Viewing data as a table, performing cleaning activities like checking for spellings, removal of blanks and wrong cases, removal of invalid values from data, etc.
- `Data Visualization` – plotting appealing graphs, so anyone who looks at the data can know what story the data tells us.
- `Pandas` – short for `Panel Data` (A panel is a 3D container of data) – is a library in python which contains in-built functions to clean, transform, manipulate, visualize and analyze data.

## Key Features of Pandas
<img src="images/Python-Pandas-Features.webp" height=600px width=600px>


- It has a fast and efficient DataFrame object with the default and customized indexing.
- Used for reshaping and pivoting of the data sets.
- Group by data for aggregations and transformations.
- It is used for data alignment and integration of the missing data.
- Provide the functionality of Time Series.
- Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time series.
- Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-ordering, and re-shaping.
- It integrates with the other libraries such as SciPy, and scikit-learn.
- Provides fast performance, and If you want to speed it, even more, you can use the Cython.

## Data Types
A data type is used by a programming language to understand how to store and manipulate data.
- `int` : Integer number, eg: 10, 12
- `float` : Floating point number, eg: 100.2, 3.1415
- `bool` : True/False value
- `object` : Test, non-numeric, or a combination of text and non-numeric values, eg: Apple
- `DateTime` : Date and time values
- `category` : A finite list of values

## What does Pandas deal with?
There are two major categories of data that you can come across while doing data analysis.
- One dimensional data
- Two-dimensional data

These data can be of any data type. Character, number or even an object.

> **Series in Pandas is one-dimensional data, and data frames are 2-dimensional data. A series can hold only a single data type, whereas a data frame is meant to contain more than one data type.**

![](images/dataframe.webp)

**In the example shown above, `Name` is a `series` and it is of the datatype – `Object` and it is treated as a character array. `Age` is another series and it is of the type – `Integer`. Third is the `Marks` is the third series and it is of the type `Integer` again.  The individual Series are one dimensional and hold only one data type. However, the `dataframe` as a whole contains more than 2 dimensions and is `heterogeneous` in nature.**

## Creating Series & data frames in python

#### Creating a simple Serie

In [6]:

#importing pandas library
import pandas as pd
 
#Creating a list
name = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']

#Creating a Series by passing list variable to Series() function of pandas 
name_series = pd.Series(name)

#Printing Series
print(name_series)

0    Ehtisham
1         Ali
2      Ayesha
3         Dua
dtype: object


In [9]:
# Let’s check type of Series
print("Type of name_Series is : ",type(name_series))

Type of name_Series is :  <class 'pandas.core.series.Series'>


#### Creating multiple series

In [12]:
name = ['Ehtisham', 'Ali', 'Ayesha', 'Dua']
marks = [91.5,93,80,65]
age = [21,18,16,6]

#Creating a Series by passing list variable to Series() function of pandas 
name_ser = pd.Series(name)
marks_ser = pd.Series(marks)
age_ser = pd.Series(age)

#Printing Series
print("Name Series : ", name_ser, sep="\n")
print("Marks Series : ", marks_ser, sep="\n")
print("Age Series : ", age_ser, sep="\n")

Name Series : 
0    Ehtisham
1         Ali
2      Ayesha
3         Dua
dtype: object
Marks Series : 
0    91.5
1    93.0
2    80.0
3    65.0
dtype: float64
Age Series : 
0    21
1    18
2    16
3     6
dtype: int64


#### Creating Dataframe from multiple Series 

In [13]:
#Creating a Series by passing list variable to Series() function of pandas 
name_ser = pd.Series(name)
marks_ser = pd.Series(marks)
age_ser = pd.Series(age)

# Creating a Dictionary by passing series as values of dictionary
dic = {'Name':name_ser,
      'Marks':marks_ser,
      'Age':age_ser
      }

# Create dataframe by passing dictionary to pd.DataFrame function of pandas
df = pd.DataFrame(dic)
print("Printing of DataFrame .... ")
df

Printing of DataFrame .... 


Unnamed: 0,Name,Marks,Age
0,Ehtisham,91.5,21
1,Ali,93.0,18
2,Ayesha,80.0,16
3,Dua,65.0,6


#### How to add new column to the dataframe

In [14]:
address = pd.Series(['Lahore','Okara','Okara','Okara'])
##Creating new column in the dataframe by providing s Series created using list
df['Address'] = address
print("Printing of DataFrame .... ")
df

Printing of DataFrame .... 


Unnamed: 0,Name,Marks,Age,Address
0,Ehtisham,91.5,21,Lahore
1,Ali,93.0,18,Okara
2,Ayesha,80.0,16,Okara
3,Dua,65.0,6,Okara


## All statistical functions
- `count()` : Returns the number of times an element/data has occurred (non-null)
- `sum()`	: Returns sum of all values
- `mean()` : Returns the average of all values
- `median()` : Returns the median of all values
- `mode()` : Returns the mode
- `std()`	: Returns the standard deviation
- `min()`	: Returns the minimum of all values
- `max()`	: Returns the maximum of all values
- `abs()`	: Returns the absolute value

In [17]:
print("Total number of elements in each column of dataframe ")
df.count()

Total number of elements in each column of dataframe 


Name       4
Marks      4
Age        4
Address    4
dtype: int64

## Input and Output

- Often, you won’t be creating data but will be having it in some form, and you would want to import it to run your analysis on it. Fortunately, Pandas allows you to do this. Not only does it help in importing data, but you can also save your data in your desired format using Pandas.
- The below table shows the formats supported by Pandas, the function to read files using Pandas, and the function to write files.
|Input |type      |	Reader	Writer |
|------|----------|----------------|
|CSV   |read_csv  |  to_csv        |
|JSON  |read_json | to_json
|HTML  |read_html |to_html
|Excel |read_excel|to_excel
|SAS   |read_sas  |–
|Python|Pickle    |	read_pickle	to_pickle
|SQL   |read_sql  |to_sql
|Google|Big Query | read_gbq	to_gbq

In [38]:
#Read input file
df = pd.read_csv('datasets/psl.csv')
df.head()

Unnamed: 0,psl_year,match_number,team_1,team_2,inning,over,ball,runs,total_runs,wickets,is_four,is_six,is_wicket,wicket,wicket_text,result
0,2016,1,Islamabad United,Quetta Gladiators,1,1,1,0,0,0,False,False,False,,,Gladiators
1,2016,1,Islamabad United,Quetta Gladiators,1,1,2,0,0,0,False,False,False,,,Gladiators
2,2016,1,Islamabad United,Quetta Gladiators,1,1,3,0,0,0,False,False,False,,,Gladiators
3,2016,1,Islamabad United,Quetta Gladiators,1,1,4,0,0,0,False,False,False,,,Gladiators
4,2016,1,Islamabad United,Quetta Gladiators,1,1,5,0,0,0,False,False,False,,,Gladiators


In [42]:
# Save a dataframe to CSV File
data = {'Name':['Captain America', 'Iron Man', 'Hulk', 'Thor','Black Panther'],
        'Rating':[100, 80, 84, 93, 90],
        'Place':['USA','USA','USA','Asgard','Wakanda']}
# Create dataframe from above dictionary
df = pd.DataFrame(data)
df.to_csv("datasets/avengers.csv")

In [44]:
# !ls datasets/

## Aggregation
- The aggregation function can be applied against a single or more column. You can either apply the same aggregate function across various columns or different aggregate functions across various columns.
- Syntax : 
 >- DataFrame.aggregate(self, func, axis=0, *args, ***kwargs)
 
<img src="images/pandas-agg-func.png" height=400px width=600px>

In [56]:
data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
gapminder_data = gapminder[['continent','pop']]
gapminder_data.head()
# gapminder.head()

Unnamed: 0,continent,pop
0,Asia,8425333.0
1,Asia,9240934.0
2,Asia,10267083.0
3,Asia,11537966.0
4,Asia,13079460.0


In [51]:
# Using Aggregate Functions on Series
mean  = gapminder_data['pop'].aggregate('mean')
print("Mean of population : ", mean)

Min  = gapminder_data['pop'].aggregate('min')
print("Minimum value of population : ", Min)

Max  = gapminder_data['pop'].aggregate('max')
print("Maximum value of population : ", Max)

Std  = gapminder_data['pop'].aggregate('std')
print("Std of population : ", Std)

Mean of population :  29601212.32511736
Minimum value of population :  60011.0
Maximum value of population :  1318683096.0
Std of population :  106157896.74682792


In [54]:
# Using multiple Aggregate Functions on Dataframe
gapminder_data['pop'].aggregate(['sum','min','max'])

sum    5.044047e+10
min    6.001100e+04
max    1.318683e+09
Name: pop, dtype: float64

In [57]:
# Using multiple Aggregate Functions on Multiple columns of Dataframe
gapminder[['pop','lifeExp']].aggregate(['sum','min','max'])

Unnamed: 0,pop,lifeExp
sum,50440470000.0,101344.44468
min,60011.0,23.599
max,1318683000.0,82.603


In [59]:
# We can also perform above task by using below code
gapminder.aggregate({'pop':['sum','min','max'],
                    'lifeExp':['sum','min','max']})

Unnamed: 0,pop,lifeExp
sum,50440470000.0,101344.44468
min,60011.0,23.599
max,1318683000.0,82.603


In [63]:
# df.describe()  gives overall descriptive view of our dataset
gapminder.describe()

Unnamed: 0,year,pop,lifeExp,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,29601210.0,59.474439,7215.327081
std,17.26533,106157900.0,12.917107,9857.454543
min,1952.0,60011.0,23.599,241.165876
25%,1965.75,2793664.0,48.198,1202.060309
50%,1979.5,7023596.0,60.7125,3531.846988
75%,1993.25,19585220.0,70.8455,9325.462346
max,2007.0,1318683000.0,82.603,113523.1329


## Groupby
- Pandas groupby function is used to split the DataFrame into groups based on some criteria. 
- Similar to the `SQL GROUP BY` clause pandas `DataFrame.groupby()` function is used to collect the identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.

<img src="images/pandas-groupby-standard-dev.png.webp" height=500px width=500px>
<img src="images/groupby-example.png" height=700px width=700px>

### Syntax of Pandas DataFrame.groupby()

       
       `DataFrame.groupby(by=None, axis=0, level=None, as_index=True,     
       sort=True, group_keys=True, squeeze=<no_default>,      
       observed=False, dropna=True)`
       
       
- `by` – List of column names to group by
- `axis` – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
- `level` – Used with MultiIndex.
- `as_index` – sql style grouped otput.
- `sort` – Default to True. Specify whether to sort after group
- `group_keys` – add group keys or not
- `squeeze` – depricated in new versions
- `observed` – This only applies if any of the groupers are Categoricals.
- `dropna` – Default to False. Use True to drop None/Nan on sory key.

In [77]:
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Hadoop","Spark","Python","NA"],
    'Fee' :[22000,25000,23000,24000,26000,25000,25000,22000,1500],
    'Duration':['30days','50days','55days','40days','60days','35days','30days','50days','40days'],
    'Discount':[1000,2300,1000,1200,2500,None,1400,1600,0]
          })
df = pd.DataFrame(technologies)
print(df)

   Courses    Fee Duration  Discount
0    Spark  22000   30days    1000.0
1  PySpark  25000   50days    2300.0
2   Hadoop  23000   55days    1000.0
3   Python  24000   40days    1200.0
4   Pandas  26000   60days    2500.0
5   Hadoop  25000   35days       NaN
6    Spark  25000   30days    1400.0
7   Python  22000   50days    1600.0
8       NA   1500   40days       0.0


#### Use groupby() to compute the sum Fee and Discount of each course

In [78]:
df.groupby(['Courses']).sum()

Unnamed: 0_level_0,Fee,Discount
Courses,Unnamed: 1_level_1,Unnamed: 2_level_1
Hadoop,48000,1000.0
,1500,0.0
Pandas,26000,2500.0
PySpark,25000,2300.0
Python,46000,2800.0
Spark,47000,2400.0


In [80]:
# similarly
df.groupby(['Courses']).aggregate('sum')

Unnamed: 0_level_0,Fee,Discount
Courses,Unnamed: 1_level_1,Unnamed: 2_level_1
Hadoop,48000,1000.0
,1500,0.0
Pandas,26000,2500.0
PySpark,25000,2300.0
Python,46000,2800.0
Spark,47000,2400.0


#### pandas groupby() on Two or More Columns like Courses and Duration

In [81]:
df.groupby(['Courses','Duration']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Fee,Discount
Courses,Duration,Unnamed: 2_level_1,Unnamed: 3_level_1
Hadoop,35days,25000.0,
Hadoop,55days,23000.0,1000.0
,40days,1500.0,0.0
Pandas,60days,26000.0,2500.0
PySpark,50days,25000.0,2300.0
Python,40days,24000.0,1200.0
Python,50days,22000.0,1600.0
Spark,30days,23500.0,1200.0


#### Add Index to the grouped data
- By default `groupby()` result doesn’t include row Index, you can add the index using `DataFrame.reset_index()` method.

In [82]:
df.groupby(['Courses','Duration']).mean().reset_index()

Unnamed: 0,Courses,Duration,Fee,Discount
0,Hadoop,35days,25000.0,
1,Hadoop,55days,23000.0,1000.0
2,,40days,1500.0,0.0
3,Pandas,60days,26000.0,2500.0
4,PySpark,50days,25000.0,2300.0
5,Python,40days,24000.0,1200.0
6,Python,50days,22000.0,1600.0
7,Spark,30days,23500.0,1200.0


#### Remove sorting on grouped results by using `sort` parameter of df.groupby()

In [91]:
df2=df.groupby(by=['Courses'], sort=False).sum()
df2

Unnamed: 0_level_0,Fee,Discount
Courses,Unnamed: 1_level_1,Unnamed: 2_level_1
Spark,47000,2400.0
PySpark,25000,2300.0
Hadoop,48000,1000.0
Python,46000,2800.0
Pandas,26000,2500.0
,1500,0.0


#### Apply More Aggregations
- You can also compute several aggregations at the same time in pandas by passing the list of agg functions to the `aggregate().`

#### Compute minimu and maximum fee of each course

In [92]:
df.groupby('Courses')['Fee'].aggregate(['min','max'])

Unnamed: 0_level_0,min,max
Courses,Unnamed: 1_level_1,Unnamed: 2_level_1
Hadoop,23000,25000
,1500,1500
Pandas,26000,26000
PySpark,25000,25000
Python,22000,24000
Spark,22000,25000


In [93]:
# Groupby multiple columns & multiple aggregations
df.groupby('Courses').aggregate({'Duration':'count',
                                'Fee':['min','max']})

Unnamed: 0_level_0,Duration,Fee,Fee
Unnamed: 0_level_1,count,min,max
Courses,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Hadoop,2,23000,25000
,1,1500,1500
Pandas,1,26000,26000
PySpark,1,25000,25000
Python,2,22000,24000
Spark,2,22000,25000


## Practice Questions

### Regiment
- A regiment is a military unit. Its role and size varies markedly, depending on the country, service and/or a specialisation.



#### Step 1. Import the necessary libraries


#### Step 2. Create the DataFrame with the following values and Assign it to a variable called regiment.

In [110]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
regiment = pd.DataFrame(raw_data)
# regiment

#### Step 3. What is the mean `preTestScore` from the regiment `Nighthawks`(Nightbird/Night owl)?


In [115]:
# regiment[regiment['regiment'] == 'Nighthawks']

#### Step 4. Present/show general statistics by `company` of regiment.

In [117]:
# regiment.groupby('company').describe()

#### Step 5. What is the mean of each company's preTestScore?

In [121]:
# regiment.groupby('company')['preTestScore'].mean()

# OR

# regiment.groupby('company').mean()

#### Step 6. Presents/shows the `mean` preTestScores grouped by regiment and company.

In [126]:
# regiment.groupby(['regiment','company'])['preTestScore'].mean()
# OR
# regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()

#### Step 7. Presents/shows the `mean` preTestScores grouped by regiment and company with reset_index parameter

In [127]:
# regiment.groupby(['regiment', 'company']).preTestScore.mean().reset_index()

#### Step 8. Group the entire dataframe by regiment and company , also perform `sum` aggregate function.

In [130]:
# regiment.groupby(['regiment','company']).sum()

#### Step 9. What is the number of observations in each regiment and company.

In [133]:
# regiment.groupby(['regiment','company']).size()
# OR 
# regiment.groupby(['regiment','company']).count()

#### Step 10. Iterate over a group and print the name and the whole data from the regiment

In [137]:
# # Group the dataframe by regiment, and for each regiment,
# for name, group in regiment.groupby('regiment'):
#     # print the name of the regiment
#     print('Name : ',name)
# #     print data of that regiment
#     print(group)

## Merging, Joining and Concatenation
Before I start with Pandas join and merge functions, let me introduce you to four different types of joins, they are inner join, left join, right join, outer join.
<img src="images/Untitled.png" height=500px width=500px align="right"> 

- **Full outer join**: Combines results from both DataFrames. The result will have all columns from both DataFrames.
- **Inner join**: Only those rows which are present in both DataFrame A and DataFrame B will be present in the output.
- **Right join**: Right join uses all records from DataFrame B and matching records from DataFrame A.
- **Left join**: Left join uses all records from DataFrame A and matching records from DataFrame B.

<img src="images/joins.png" height=600px width=600px align="left" > 


### Merging
- Merging a Dataframe with one unique key.

#### Syntax:
```
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
``` 
- `left` − A DataFrame object.
- `right` − Another DataFrame object.
- `on` − Columns (names) to join on. Must be found in both the left and right DataFrame objects.
- `left_on` − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
- `right_on` − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
- `left_index` − If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
- `right_index` − Same usage as left_index for the right DataFrame.
- `how` − One of 'left', 'right', 'outer', 'inner'. Defaults to inner. Each method has been described below.
- `sort` − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

In [140]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on='key')
final_df

Unnamed: 0,key,Name,Age,Address,Qualification
0,K0,Mercy,27,Canada,Btech
1,K1,Prince,24,UK,B.A
2,K2,John,22,India,MS
3,K3,Cena,32,USA,Phd


#### Merging Dataframe using multiple keys.

In [141]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'])
final_df

Unnamed: 0,key,Name,Address,Age,Qualification
0,K0,Mercy,Canada,27,Btech
1,K2,John,India,22,MS


#### Left merge
- In pd.merge() I pass the argument `how = left` to perform a left merge.

In [142]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'], how='left')
final_df

Unnamed: 0,key,Name,Address,Age,Qualification
0,K0,Mercy,Canada,27,Btech
1,K1,Prince,Australia,24,
2,K2,John,India,22,MS
3,K3,Cena,Japan,32,


#### Right merge
- In pd.merge() I pass the argument `how = right` to perform a left merge.

In [143]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'], how='right')
final_df

Unnamed: 0,key,Name,Address,Age,Qualification
0,K0,Mercy,Canada,27.0,Btech
1,K1,,UK,,B.A
2,K2,John,India,22.0,MS
3,K3,,USA,,Phd


#### Outer Merge
- In pd.merge(), I pass the argument `how = outer` to perform a outer merge.

In [144]:
# Define a dictionary containing employee data 

data1 = {'key':['K0','K1','K2','K3'],
         'Name':['Mercy', 'Prince', 'John', 'Cena'],
          'Address':['Canada', 'Australia', 'India', 'Japan'],
         'Age':[27, 24, 22, 32],} 
# Define a dictionary containing employee data 

data2 = {'key':['K0','K1','K2','K3'],
         'Address':['Canada', 'UK', 'India', 'USA'], 
         'Qualification':['Btech', 'B.A', 'MS', 'Phd']} 

# Convert the dictionary into DataFrame  
df1 = pd.DataFrame(data1)
# Convert the dictionary into DataFrame  
df2 = pd.DataFrame(data2) 

# merging of two dataframes on basis ok `key` 
final_df = pd.merge(df1, df2, on=['key','Address'], how='outer')
final_df

Unnamed: 0,key,Name,Address,Age,Qualification
0,K0,Mercy,Canada,27.0,Btech
1,K1,Prince,Australia,24.0,
2,K2,John,India,22.0,MS
3,K3,Cena,Japan,32.0,
4,K1,,UK,,B.A
5,K3,,USA,,Phd


## Join
- Join is used to combine DataFrames having different index values.
- `I have two different tables in Python but I’m not sure how to join them. What criteria should I consider? What are the different ways I can join these tables?`
- Sound familiar? I have come across this question plenty of times on online discussion forums. Working with one table is fairly straightforward but things become challenging when we have data spread across two or more tables.
- This is where the concept of Joins comes in. I cannot emphasize the number of times I have used these Joins in Pandas! They’ve come in especially handy during data science hackathons when I needed to quickly join multiple tables.

#### Understanding the Problem Statement

- I’m sure you’re quite familiar with e-commerce sites like `Amazon` and `Flipkart` these days. We are bombarded by their advertisements when we’re visiting non-related websites – that’s the power of targeted marketing!
- We’ll take a simple problem from a related marketing brand here. We are given two tables – one which contains data about products and the other that has customer-level information.
- We will use these tables to understand how the different types of joins work using Pandas.

#### Note: 
 >- Our task is to use our joining skills and generate meaningful information from the data.

In [146]:
# The product dataframe contains product details like Product_ID, Product_name, Category, Price, and Seller_City. 
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})

# The customer dataframe contains details like id, name, age, Product_ID, Purchased_Product, and City.
customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})

In [101]:
gapminder.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [69]:
!ls datasets/

avengers.csv  gapminder.csv  psl.csv


In [70]:
gapminder.head()

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
0,Afghanistan,1952,8425333.0,Asia,28.801,779.445314
1,Afghanistan,1957,9240934.0,Asia,30.332,820.85303
2,Afghanistan,1962,10267083.0,Asia,31.997,853.10071
3,Afghanistan,1967,11537966.0,Asia,34.02,836.197138
4,Afghanistan,1972,13079460.0,Asia,36.088,739.981106


In [76]:
gapminder[gapminder.country == 'Pakistan']

Unnamed: 0,country,year,pop,continent,lifeExp,gdpPercap
1164,Pakistan,1952,41346560.0,Asia,43.436,684.597144
1165,Pakistan,1957,46679944.0,Asia,45.557,747.083529
1166,Pakistan,1962,53100671.0,Asia,47.67,803.342742
1167,Pakistan,1967,60641899.0,Asia,49.8,942.408259
1168,Pakistan,1972,69325921.0,Asia,51.929,1049.938981
1169,Pakistan,1977,78152686.0,Asia,54.043,1175.921193
1170,Pakistan,1982,91462088.0,Asia,56.158,1443.429832
1171,Pakistan,1987,105186881.0,Asia,58.245,1704.686583
1172,Pakistan,1992,120065004.0,Asia,60.838,1971.829464
1173,Pakistan,1997,135564834.0,Asia,61.818,2049.350521


In [75]:
gapminder['country'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina',
       'Australia', 'Austria', 'Bahrain', 'Bangladesh', 'Belgium',
       'Benin', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon',
       'Canada', 'Central African Republic', 'Chad', 'Chile', 'China',
       'Colombia', 'Comoros', 'Congo Dem. Rep.', 'Congo Rep.',
       'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Ethiopia',
       'Finland', 'France', 'Gabon', 'Gambia', 'Germany', 'Ghana',
       'Greece', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Haiti',
       'Honduras', 'Hong Kong China', 'Hungary', 'Iceland', 'India',
       'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy',
       'Jamaica', 'Japan', 'Jordan', 'Kenya', 'Korea Dem. Rep.',
       'Korea Rep.', 'Kuwait', 'Lebanon',