<center>
<img src='https://drive.google.com/uc?id=1jx4705-9ZraacMfKQfLc7ZDSgaL4C9Sj' height='180px' />
</center>

<h3>Fill the Missing Values, Renaming & Combining </h3>

In [None]:
## Replacing missing values is a common operation. 
## Pandas provides a really handy method for this problem: fillna(). 
## fillna() provides a few different strategies for mitigating such data.

## Example 1:  First Understand : How NaN values get into my dataset ???
import pandas as pd

## Note : I have purposely made a df with some numbers and some improper values
df = pd.DataFrame({'values': ['700','ABC300','500','900XYZ']})

## Now , if I convert to numeric, obviously alpha-numeric data will not get converted
df['values'] = pd.to_numeric(df['values'], errors='coerce')
## use errors='coerce' to replace the error-values by np.NaN
## https://stackoverflow.com/questions/38304691/drop-rows-with-errors-for-pandas-data-coercion

print(df)
## such actions introduce NaN values in your dataset.

   values
0   700.0
1     NaN
2   500.0
3     NaN


In [None]:
## In order to replace the NaN values with zeros or any other value for a column using pandas
df['values'] = df['values'].fillna(0)

print(df)

   values
0   700.0
1     0.0
2   500.0
3     0.0


**`Remember`** : to fill NaN values from a column use fillna() method over a column.
<pre> df['DataFrame Column'] = df['DataFrame Column'].fillna(0) </pre>

**In case** , you plan to replace all NaN's in the **entire dataframe** then **apply fillna()** over the dataframe or u can also use  **replace()**, <font color='darkgreen'>refer example below:</font>

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'values_1': ['700','ABC300','500','900XYZ'],
                   'values_2': ['DDD200','150','350ZZZ','400'] 
                   })

df = df.apply (pd.to_numeric, errors='coerce')
print (df)

print("----------------------------------")

new_df = df.fillna(0)   
## or use this : 
## new_df = df.replace(np.nan,0)

print (new_df)

   values_1  values_2
0     700.0       NaN
1       NaN     150.0
2     500.0       NaN
3       NaN     400.0
----------------------------------
   values_1  values_2
0     700.0       0.0
1       0.0     150.0
2     500.0       0.0
3       0.0     400.0


<font color='red'><b>How to replace NaNs by preceding values in pandas DataFrame?</b></font>
<pre>
<b>Hint use</b> : df.fillna(method='ffill'). Also <u>self try</u> df.fillna(method='bfill') 
</pre>

In [None]:
## creating a dummy dataframe

df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])

df.fillna(method='ffill')
## ffill means forward fill
print(df)
## This method doesn't modify the DataFrame inplace ; specify inplace=True
print("-------------------")
df.fillna(method='ffill', inplace=True)
print(df)
print("-------------------")
df.fillna(method='bfill', inplace=True)  #yourself
print(df)

     0    1    2
0  1.0  2.0  3.0
1  4.0  NaN  NaN
2  NaN  NaN  9.0
-------------------
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0
-------------------
     0    1    2
0  1.0  2.0  3.0
1  4.0  2.0  3.0
2  4.0  2.0  9.0


Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. Use **`rename()`**, which lets you change `index` names and/or `column` names.

<b><u>RENAME SPECIFIC COLUMNS</u></b>

Use the **`df.rename()`** function and refer the columns to be renamed. Not all the columns have to be renamed:

<pre>
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})

### Or rename the existing DataFrame (rather than creating a copy) 
df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)

In [None]:
df = pd.DataFrame('suven', index=range(3), columns=list('abcde'))
df

Unnamed: 0,a,b,c,d,e
0,suven,suven,suven,suven,suven
1,suven,suven,suven,suven,suven
2,suven,suven,suven,suven,suven


In [None]:
## The following methods all work and produce the same output:
df2 = df.rename({'a': 'X', 'b': 'Y'}, axis=1)  # newer approach, common nowadays

## df2 = df.rename({'a': 'X', 'b': 'Y'}, axis='columns')

## df2 = df.rename(columns={'a': 'X', 'b': 'Y'})  # old method  

df2

## Note : for inplace changes =>  df.rename({'a': 'X', 'b': 'Y'}, axis=1, inplace=True)

Unnamed: 0,X,Y,c,d,e
0,suven,suven,suven,suven,suven
1,suven,suven,suven,suven,suven
2,suven,suven,suven,suven,suven


rename_axis() : Alter the name of the index or columns
--
The **rename_axis()** function is used to set the name of the axis for the index or columns.

In [None]:
## creating a dummy dataframe

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(df)

print("------------------------------------")

df.rename_axis("numbers", inplace=True)
print(df)

print("------------------------------------")

df.rename_axis("bar", axis="columns", inplace=True)
print(df)

   A  B
0  1  4
1  2  5
2  3  6
------------------------------------
         A  B
numbers      
0        1  4
1        2  5
2        3  6
------------------------------------
bar      A  B
numbers      
0        1  4
1        2  5
2        3  6


<font color='red'> Caution : rename_axis() is deprecated and will be removed in a future version. Use rename instead. </font>

**Combining**
--

When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are **concat(), join(), and merge()**. Most of what merge() can do can also be done more simply with join(), so we will omit it and focus on the first two functions here.

The simplest combining method is **concat()**. Given a list of elements, this function will smush those elements together along an axis.

In [None]:
## Combine two DataFrame objects with identical columns.

df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
print(df1)

print("--------------------")

df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])
print(df2)

print("--------------------")

newdf1 = pd.concat([df1, df2])   ## default axis = 0 , i.e concat happens columnwise
## Use .reset_index() to reset the index from 0 onwards
newdf1 = pd.concat([df1, df2]).reset_index()
print(newdf1)

print("--------------------")

newdf2 = pd.concat([df1, df2], axis=1)   ## Combine DataFrame objects horizontally along the x axis 
print(newdf2)

  letter  number
0      a       1
1      b       2
--------------------
  letter  number
0      c       3
1      d       4
--------------------
   index letter  number
0      0      a       1
1      1      b       2
2      0      c       3
3      1      d       4
--------------------
  letter  number letter  number
0      a       1      c       3
1      b       2      d       4


Joining on index
--

**DataFrame.join()** is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a **`very basic example`**:

In [None]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=['K0', 'K1', 'K2'])
print(left)

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'], 'D': ['D0', 'D2', 'D3']}, index=['K0', 'K2', 'K3'])
print(right)

print("----default is left Join clause--------")
result = left.join(right)  ## result = left.join(right, how='left')
print(result)
print("----------try how='inner' join --------")
result = left.join(right, how='inner')   ## Strictly only common columns
print(result)
print("----------try how='outer' --------")
result = left.join(right, how='outer')   ## default o/p + non-matching index
print(result)
print("----------try how='right' --------")
result = left.join(right, how='right')   ## taking the indexes of the rgt table
print(result)

     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
     C   D
K0  C0  D0
K2  C2  D2
K3  C3  D3
----default is left Join clause--------
     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C2   D2
----------try how='inner' join --------
     A   B   C   D
K0  A0  B0  C0  D0
K2  A2  B2  C2  D2
----------try how='outer' --------
      A    B    C    D
K0   A0   B0   C0   D0
K1   A1   B1  NaN  NaN
K2   A2   B2   C2   D2
K3  NaN  NaN   C3   D3
----------try how='right' --------
      A    B   C   D
K0   A0   B0  C0  D0
K2   A2   B2  C2  D2
K3  NaN  NaN  C3  D3


In [None]:
## What if our dataframes have few matching columnNames ??
## The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both.
leftCommon = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=['K0', 'K1', 'K2'])
print(leftCommon)

rightCommon = pd.DataFrame({'A': ['A4', 'A5', 'A6'], 'D': ['D0', 'D2', 'D3']}, index=['K0', 'K2', 'K3'])
print(rightCommon)

resultCommon = leftCommon.join(rightCommon, lsuffix='_Left', rsuffix='_Rgt')
print(resultCommon)

## U can try , if you don't put lsuffix='_Left', rsuffix='_Rgt' parameters
## resultCommon_1 = leftCommon.join(rightCommon)
## print(resultCommon_1)

     A   B
K0  A0  B0
K1  A1  B1
K2  A2  B2
     A   D
K0  A4  D0
K2  A5  D2
K3  A6  D3
   A_Left   B A_Rgt    D
K0     A0  B0    A4   D0
K1     A1  B1   NaN  NaN
K2     A2  B2    A5   D2


<h3> <font color='red'> Question Time ( Solve all Qns in 15 mins )</h3>
<hr>

1. I have three dataframes. Their shapes are (2656, 246), (2656, 2412) and (2656, 7025). I want to merge the dataframes. I know, it will result in a resultant (2656, 9683) Dataframe.

<pre> Hint : 246+2412+7025 = 9683 </pre>

**Ans to Q1**

<pre>
Assuming that the rows are in same order that you wish to merge all of the dataframes, 
you can use the <b>concat</b> command specificying axis=1.

new_df = pd.concat([df1, df2, df3], axis=1)

If the row index for each of the data frames are different and you want to merge them in the current order, 
you can also apply <b>ignore_index</b>:

new_df = pd.concat([df1, df2, df3], ignore_index=True)
</pre>

**Each Student is expected to create some dummy df's with (2,3), (2,4), (2,2) dimensions respectively and check my answers** 

In [None]:
## Space to try out the above : new_df = pd.concat([df1, df2, df3], ignore_index=True)
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1'], 'C': ['C0', 'C1']}, index=['0', '1'])
print(df1)
print('-----------------------')
df2 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1'], 'C': ['C0', 'C1'], 'D': ['D0', 'D1']}, index=['0', '1'])
print(df2)
print('-----------------------')
df3 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']}, index=['0', '1'])
print(df3)
print('-----------------------')
print(pd.concat([df1, df2, df3], axis=1))
print('-----------------------')
new_df = pd.concat([df1, df2, df3], ignore_index=True)
print(new_df)

    A   B   C
0  A0  B0  C0
1  A1  B1  C1
-----------------------
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
-----------------------
    A   B
0  A0  B0
1  A1  B1
-----------------------
    A   B   C   A   B   C   D   A   B
0  A0  B0  C0  A0  B0  C0  D0  A0  B0
1  A1  B1  C1  A1  B1  C1  D1  A1  B1
-----------------------
    A   B    C    D
0  A0  B0   C0  NaN
1  A1  B1   C1  NaN
2  A0  B0   C0   D0
3  A1  B1   C1   D1
4  A0  B0  NaN  NaN
5  A1  B1  NaN  NaN


2. Given this df :
<pre>
df = pd.DataFrame({'age':[24, 20, 19],
                       'color':['blue', 'green', 'red'],
                       'food':['Dal-Khichdi', 'Chicken', 'Eggs'],
                       'height':[165, 170, 120],
                       'score':[8.6, 8.3, 9.0],
                       'state':['NAGPUR', 'CANADA', 'MUMBAI']},
                       index = ['Ankita', 'Hammad', 'Aden']) </pre>

<font color='darkgreen'><b>Get me this o/p :</b></font>

<img src= 'https://drive.google.com/uc?id=1YDBcasiBAop-0kiFuaBVmYk1cN_9Lzgy' />

In [None]:
## creating the above dummy dataframe
df = pd.DataFrame({'age':[24, 20, 19],
                    'color':['blue', 'green', 'red'],
                    'food':['Dal-Khichdi', 'Chicken', 'Eggs'],
                    'height':[165, 170, 120],
                    'score':[8.6, 8.3, 9.0],
                    'state':['MH', 'Alberta', 'MH']},
                    index = ['Ankita', 'Hammad', 'Aden']) 

## Hint : use rename_axis twice

df.rename_axis("axis1", inplace=True)
print(df)

print("------------------------------------")

df.rename_axis("axis2", axis="columns", inplace=True)
print(df)

        age  color         food  height  score    state
axis1                                                  
Ankita   24   blue  Dal-Khichdi     165    8.6       MH
Hammad   20  green      Chicken     170    8.3  Alberta
Aden     19    red         Eggs     120    9.0       MH
------------------------------------
axis2   age  color         food  height  score    state
axis1                                                  
Ankita   24   blue  Dal-Khichdi     165    8.6       MH
Hammad   20  green      Chicken     170    8.3  Alberta
Aden     19    red         Eggs     120    9.0       MH


3. Reuse the above df and **get me the following o/p**:

<img src='https://drive.google.com/uc?id=1rV0wf9CXLtr6ytx5ql5tv-Vkejitqx3l' />


In [None]:
## Note (from the o/p image) the index has been changed.
## previously we had 'names' as the index. We are replacing it with 2 indexes.
## If a dataframe is indexed by more than 2 indexes then its called multi-index
## Hint : use set_index()  
df = pd.DataFrame({'age':[24, 20, 19],
                    'color':['blue', 'green', 'red'],
                    'food':['Dal-Khichdi', 'Chicken', 'Eggs'],
                    'height':[165, 170, 120],
                    'score':[8.6, 8.3, 9.0],
                    'state':['MH', 'Alberta', 'MH']},
                    index = ['Ankita', 'Hammad', 'Aden'])
df1 = df.set_index(['state', 'color'])
df1

Unnamed: 0_level_0,Unnamed: 1_level_0,age,food,height,score
state,color,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MH,blue,24,Dal-Khichdi,165,8.6
Alberta,green,20,Chicken,170,8.3
MH,red,19,Eggs,120,9.0


4. Reuse the above orginal df and **get me the following o/p**:

<img src='https://drive.google.com/uc?id=1it8aUlrU8yebTYsNnAlLw0-iksCF1VKq' />

In [None]:
## Notice , we are setting multiple indexes as from Q3.
## But in this above o/p we want the orginal index also
## Hint : use append=True  option in set_index()

df = pd.DataFrame({'age':[24, 20, 19],
                    'color':['blue', 'green', 'red'],
                    'food':['Dal-Khichdi', 'Chicken', 'Eggs'],
                    'height':[165, 170, 120],
                    'score':[8.6, 8.3, 9.0],
                    'state':['MH', 'Alberta', 'MH']},
                    index = ['Ankita', 'Hammad', 'Aden'])
df1 = df.set_index(['state', 'color'],append=True)
df1

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age,food,height,score
Unnamed: 0_level_1,state,color,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ankita,MH,blue,24,Dal-Khichdi,165,8.6
Hammad,Alberta,green,20,Chicken,170,8.3
Aden,MH,red,19,Eggs,120,9.0


5. Notice how the original index has **no name**. Can you use **rename_axis** and pass it a list of the same length as the number of index levels, for **re-naming** all indexes.

I **want some o/p** like this :

<img src = 'https://drive.google.com/uc?id=17AX_Iyb1aFETkhxZ_vowvw0H_wnNLAAP' />

In [None]:
## Remember : renaming multi-index dataframes , always specify a list.
## If u want to un-name a column then use None
df = pd.DataFrame({'age':[24, 20, 19],
                    'color':['blue', 'green', 'red'],
                    'food':['Dal-Khichdi', 'Chicken', 'Eggs'],
                    'height':[165, 170, 120],
                    'score':[8.6, 8.3, 9.0],
                    'state':['MH', 'Alberta', 'MH']},
                    index = ['Ankita', 'Hammad', 'Aden'])
df = df.set_index(['state', 'color'],append=True)
df1 = df.rename_axis(["Names", None, "Colours"])
df1

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age,food,height,score
Names,Unnamed: 1_level_1,Colours,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ankita,MH,blue,24,Dal-Khichdi,165,8.6
Hammad,Alberta,green,20,Chicken,170,8.3
Aden,MH,red,19,Eggs,120,9.0


<hr>

<small>The study content is prepared by Rocky Jagtiani ( https://linkedin.com/in/rocky-jagtiani-3b390649/) - <b>rocky@suvenconsultants.com</b> </small>

<img src ="https://drive.google.com/uc?id=1-y7gMwSV7--Bu6Y-piHkdhIrjEkeZFQW"  width = '150px' />



<small><b>Copying this material is prohibited and needs prior permission from the Author & the Management of https://www.suvenconsultants.com/  </b></small>

<hr>

Thank you for going through the Notebook. I am sure it was a fruitful learning exprience. Even you can earn your **`"Masters in Data Science"`** certification followed with Internships and Placement calls. Do look at https://datascience.suvenconsultants.com for Online live classroom training programmes from <u>Rocky Sir & his team of data scientist </u>.

![CertificationPic_In_the_NB](https://drive.google.com/uc?id=1SB127ve_Cl21hQxuC2odVZHvjGFmu_G1)