In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../datasets/GDP-countries.csv")

In [3]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


<br>

### Add new row(s) in DataFrame

1. appending new rows via a dictionary
2. using `loc`

In [4]:
## using .append

new_row = {"country" : "India",
           "year" : 2022,
           "population" : 1120376631,
           "continent" : "Asia",
           "life_exp" : 72.5,
           "gdp_cap" : 3058
          }


df = df.append(new_row, ignore_index=True)

df.tail()

  df = df.append(new_row, ignore_index=True)


Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2022,1120376631,Asia,72.5,3058.0


Important aspects to note here :
    
- `ignore_index` is must to pass; otherwise error is thrown.
- a new implicit index, pertaining to new_row, gets created automatically in `df`
- `.append` entirely functionality creates a new DataFrame in the memory. So, assigning back to `df` would thus obtain the new dataframe.
- `NaN` value is chosen for missing column value.
- Generally, as intrinsic to dictionary nature, `new_row` won't store order of key-value pairs, so they can be specified in any order, perhaps opposite to the order of columns in `df` also.
- ⛔ **`append` is NOT PREFERRED, moreover it'll soon deprecate as per above Warning.**

In [5]:
## using .loc

df.loc[len(df.index)] = ["India", 2022, 1120376631, "Asia", 72.5, 3058]    ## list

df.loc[len(df.index)] = {"country" : "India",
                         "year" : 2022,
                         "population" : 1120376631,
                         "gdp_cap" : 3058
                        }                                                  ## dict

df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2022,1120376631,Asia,72.5,3058.0
1705,India,2022,1120376631,Asia,72.5,3058.0
1706,India,2022,1120376631,,,3058.0


Important aspects to note here :

- `df.index` returns an iterable RangeIndex object, that has info about start and end indices of dataframe. This is iterable.
- ⛔ `.iloc` will result in ERROR for same task -> **iloc cannot enlarge its target object**, because end index would not be found to exist. So `loc` is preferred.
- `.loc` can accept a list or dictionary, which contain elements for new rows.
- `.loc` replaces the row elements with new values if that index already exists, otherwise adds a new row.
- For a list, values in list should match the corresponding dtype of columns. Length of length must be equal to total columns in `df`.

<br>

### Edit/Replace existing row(s) in DataFrame

- replace entire row(s). Use either `loc` or `iloc`.
- replace values of certain columns.

In [6]:
## edit 1 entire row

df.loc[1704] = ["India", "2021", 1110346000, "Asia", 68, 2500]

df.tail(4)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2021,1110346000,Asia,68.0,2500.0
1705,India,2022,1120376631,Asia,72.5,3058.0
1706,India,2022,1120376631,,,3058.0


In [None]:
## edit multiple rows



In [7]:
df[(df["country"]=="India") & (df["year"]==2022)]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1705,India,2022,1120376631,Asia,72.5,3058.0
1706,India,2022,1120376631,,,3058.0


In [62]:
df.loc[1704]["year"] = 2021

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[1704]["year"] = 2021


In [None]:
for idx in df[(df["country"]=="India") & (df["year"]==2022)].index:
    

<br>

### Drop duplicate rows
- `.drop_duplicates()`


- `df.duplicated()` - returns a boolean mask Series denoting `True` for duplicate rows, and `False` for otherwise. By default, matches via "first" occurence.


- `.drop()` - drop certain row(s) by specifying their indices; default axis of .drop is, `axis=0`.

Lets add some duplicate rows to work upon the example.<br>
(`df.loc[len(df.index)] = ["India", 2021, 1110346000, "Asia", 68, 2500]`)

Then :

In [29]:
df.tail(15)

## many duplicate rows exist in df. We can drop these occurences.

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2021,1110346000,Asia,68.0,2500.0
1705,India,2022,1120376631,Asia,72.5,3058.0
1706,India,2022,1120376631,,,3058.0
1707,India,2022,1120376631,Asia,72.5,3058.0
1708,India,2022,1120376631,Asia,72.5,3058.0
1709,India,2022,1120376631,Asia,72.5,3058.0


In [30]:
df.duplicated().tail(15)

1700    False
1701    False
1702    False
1703    False
1704    False
1705    False
1706    False
1707     True
1708     True
1709     True
1710    False
1711     True
1712     True
1713     True
1714     True
dtype: bool

<br>

Lets get all duplicated rows :

(taking help of `df.duplicated()`, which acts as boolean mask, to get all duplicate rows)

In [37]:
df[df.duplicated()]

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1707,India,2022,1120376631,Asia,72.5,3058.0
1708,India,2022,1120376631,Asia,72.5,3058.0
1709,India,2022,1120376631,Asia,72.5,3058.0
1711,India,2021,1110346000,Asia,68.0,2500.0
1712,India,2022,1120376631,Asia,72.5,3058.0
1713,India,2022,1120376631,Asia,72.5,3058.0
1714,India,2021,1110346000,Asia,68.0,2500.0


<br>

Now, drop duplicate rows :

- `keep` has 3 possible values =><br>
`first` - keeps first occurence out of each repeated sets<br>
`last` - converse to 'first', keeps last occurence<br>
`False` - deletes/drops all kinds of repeated occurences, and keeps nothing.


- this is a temporary change, so do `inplace=True` to preserve the new dataframe created after deletion of redundant rows.

In [41]:
df = df.drop_duplicates(keep="first")

df.tail(10)

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1698,Zimbabwe,1982,7636524,Africa,60.363,788.855041
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298
1704,India,2021,1110346000,Asia,68.0,2500.0
1705,India,2022,1120376631,Asia,72.5,3058.0
1706,India,2022,1120376631,,,3058.0
1710,India,2021,1110346000,Asia,68.0,2500.0


<br>

**Use of `subset` parameter :**

To drop the repeated rows based on certain columns, and keep either first, last or none of the repeated rows.

Here, let's consider for each country, only "first" occurence/row pertaining to each country would remain and rest would drop.<br>
This is like, only `df["country"].nunique` (i.e. 142) number of rows would remain after dropping. Meaning, only 1 ("first") entry of each country will now exist.

In [44]:
df.drop_duplicates(subset=["country"], keep="first")

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
0,Afghanistan,1952,8425333,Asia,28.801,779.445314
12,Albania,1952,1282697,Europe,55.230,1601.056136
24,Algeria,1952,9279525,Africa,43.077,2449.008185
36,Angola,1952,4232095,Africa,30.015,3520.610273
48,Argentina,1952,17876956,Americas,62.485,5911.315053
...,...,...,...,...,...,...
1644,Vietnam,1952,26246839,Asia,40.412,605.066492
1656,West Bank and Gaza,1952,1030585,Asia,43.160,1515.592329
1668,"Yemen, Rep.",1952,4963829,Asia,32.548,781.717576
1680,Zambia,1952,2672000,Africa,42.038,1147.388831


<br>

Drop row(s) via `.drop()` :

In [54]:
df.drop([1704,1705,1706,1710], axis=0, inplace=True)

In [55]:
df.tail()

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1699,Zimbabwe,1987,9216418,Africa,62.351,706.157306
1700,Zimbabwe,1992,10704340,Africa,60.377,693.420786
1701,Zimbabwe,1997,11404948,Africa,46.809,792.44996
1702,Zimbabwe,2002,11926563,Africa,39.989,672.038623
1703,Zimbabwe,2007,12311143,Africa,43.487,469.709298


### Some Aggregate functions on DataFrame/Series

In [13]:
df['life_exp'].agg(['mean', 'max', 'min', 'size', 'count'])

mean       59.474439
max        82.603000
min        23.599000
size     1704.000000
count    1704.000000
Name: life_exp, dtype: float64

In [14]:
df.agg(['mean', 'max', 'min', 'size', 'count'])

  df.agg(['mean', 'max', 'min', 'size', 'count'])


Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
mean,,1979.5,29601210.0,,59.474439,7215.327081
max,Zimbabwe,2007.0,1318683000.0,Oceania,82.603,113523.1329
min,Afghanistan,1952.0,60011.0,Africa,23.599,241.165876
size,1704,1704.0,1704.0,1704,1704.0,1704.0
count,1704,1704.0,1704.0,1704,1704.0,1704.0


<br>

### Sort column's values

In [65]:
## example :

# Ascending sort `year` column followed by doing a descending sort on `life_exp`.

df.sort_values(by=["year", "life_exp"], ascending=[True, False])

Unnamed: 0,country,year,population,continent,life_exp,gdp_cap
1140,Norway,1952,3327728,Europe,72.670,10095.421720
684,Iceland,1952,147962,Europe,72.490,7267.688428
1080,Netherlands,1952,10381988,Europe,72.130,8941.571858
1464,Sweden,1952,7124673,Europe,71.860,8527.844662
408,Denmark,1952,4334000,Europe,70.780,9692.385245
...,...,...,...,...,...,...
887,Lesotho,2007,2012649,Africa,42.592,1569.331442
1355,Sierra Leone,2007,6144562,Africa,42.568,862.540756
1691,Zambia,2007,11746035,Africa,42.384,1271.211593
1043,Mozambique,2007,19951656,Africa,42.082,823.685621


<hr>

## Concatenate

- `pd.concate()` => stacks multiple dataframes together along a particular axis.

By default, Concatenate is vertical operation i.e. `axis=0`.

When `axis=0` :
- corresponding columns are compared.
- vertical stacking occurs.
- Unknown entries get filled via `NaN` ie. of float64 dtype.

When `axis=1` :
- corresponding indeices are compared.
- horizontal stacking occurs.

⛔ _`.concate` is does not bear much relevance for its usage in industry, owing to its limitation that it generates NaN values, and etc reasons._

In [92]:
df_1 = pd.DataFrame([[10, 20], [30, 35], [25, 30]], columns=["B", "A"])

df_1

Unnamed: 0,B,A
0,10,20
1,30,35
2,25,30


In [93]:
df_2 = pd.DataFrame([[15, 40], [10, 5], [20, 50]],  columns=["A", "C"])

df_2

Unnamed: 0,A,C
0,15,40
1,10,5
2,20,50


In [102]:
pd.concat([df_1, df_2])

## same results with `axis=0`

Unnamed: 0,B,A,C
0,10.0,20,
1,30.0,35,
2,25.0,30,
0,,15,40.0
1,,10,5.0
2,,20,50.0


In [103]:
pd.concat([df_1, df_2]).loc[2]

Unnamed: 0,B,A,C
2,25.0,30,
2,,20,50.0


In [96]:
pd.concat([df_1, df_2], ignore_index=True)

Unnamed: 0,B,A,C
0,10.0,20,
1,30.0,35,
2,25.0,30,
3,,15,40.0
4,,10,5.0
5,,20,50.0


In [98]:
pd.concat([df_1, df_2], axis=1)

Unnamed: 0,B,A,A.1,C
0,10,20,15,40
1,30,35,10,5
2,25,30,20,50


In [112]:
pd.concat([df_1, df_2], keys=["df_1", "df_2"])

## example to generate multi level indexing

Unnamed: 0,Unnamed: 1,B,A,C
df_1,0,10.0,20,
df_1,1,30.0,35,
df_1,2,25.0,30,
df_2,0,,15,40.0
df_2,1,,10,5.0
df_2,2,,20,50.0


In [113]:
pd.concat([df_1, df_2], keys=["df_1", "df_2"], axis=1)

Unnamed: 0_level_0,df_1,df_1,df_2,df_2
Unnamed: 0_level_1,B,A,A,C
0,10,20,15,40
1,30,35,10,5
2,25,30,20,50


### Concatenation using df.append()

There also exists a shorter method of appending 1 dataframe to the other.

This is through the `append()` method.

In this, concatenation takes place only along `axis = 0`.

In [15]:
df_1 = pd.DataFrame([[10, 20], [30, 35], [25, 30]], columns=["B", "A"])
df_2 = pd.DataFrame([[15, 40], [10, 5], [20, 50]],  columns=["A", "C"])

df_1.append(df_2, ignore_index=True)

  df_1.append(df_2, ignore_index=True)


Unnamed: 0,B,A,C
0,10.0,20,
1,30.0,35,
2,25.0,30,
3,,15,40.0
4,,10,5.0
5,,20,50.0


How is it different from `pd.concat()` ?

- The append() method does not modify the orginial object


- It creates a new one with combined data


- Only works along axis = 0


- Can be used to concatenate only 2 dataframes/series at a time


**Hence, it is NOT a very efficient and preferred method**

<br>

## Merge

- `.merge()` => Combines dataframes side-by-side based on the shared column.


## Joins in Merge

#### 1. One-to-One join
  - Similar to concatenating columns

In [2]:
df1 = pd.DataFrame({'employee': ['Ved', 'Riya', 'Jahnvi', 'Umang'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = pd.DataFrame({'employee': ['Jahnvi', 'Ved', 'Riya', 'Umang'],
                    'hire_date': [2014, 2017, 2018, 2021]})
print(df1)
print("-" * 10)
print(df2)

  employee        group
0      Ved   Accounting
1     Riya  Engineering
2   Jahnvi  Engineering
3    Umang           HR
----------
  employee  hire_date
0   Jahnvi       2014
1      Ved       2017
2     Riya       2018
3    Umang       2021


In [3]:
df3 = pd.merge(df1, df2)

df3

Unnamed: 0,employee,group,hire_date
0,Ved,Accounting,2017
1,Riya,Engineering,2018
2,Jahnvi,Engineering,2014
3,Umang,HR,2021


- The `pd.merge()` function recognizes that each DataFrame has an "employee" column, and automatically joins using this column as a key.


- The result of the merge is a new DataFrame that combines the information from the two inputs.


- The order of entries in each column is not necessarily maintained.<br>
In this case, the order of the "employee" column differs between `df1` and `df2`, and the `pd.merge()` function correctly accounts for this.


- Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index

#### 2. Many to one join

  - One of the two key columns contains duplicate entries.
  - For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate.


In [4]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Steve', 'Satya', 'Sundar']})

df4

Unnamed: 0,group,supervisor
0,Accounting,Steve
1,Engineering,Satya
2,HR,Sundar


In [5]:
pd.merge(df3, df4)

Unnamed: 0,employee,group,hire_date,supervisor
0,Ved,Accounting,2017,Steve
1,Riya,Engineering,2018,Satya
2,Jahnvi,Engineering,2014,Satya
3,Umang,HR,2021,Sundar


#### 3. Many to Many joins

  -  If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

Lets look at an example to understand this better :

In [6]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})

df5

Unnamed: 0,group,skills
0,Accounting,math
1,Accounting,spreadsheets
2,Engineering,coding
3,Engineering,linux
4,HR,spreadsheets
5,HR,organization


In [7]:
pd.merge(df1, df5)

Unnamed: 0,employee,group,skills
0,Ved,Accounting,math
1,Ved,Accounting,spreadsheets
2,Riya,Engineering,coding
3,Riya,Engineering,linux
4,Jahnvi,Engineering,coding
5,Jahnvi,Engineering,linux
6,Umang,HR,spreadsheets
7,Umang,HR,organization


<br>

### <u>Understanding the concept of "Merge"</u>

In [10]:
names  = pd.DataFrame({"reg_no":[101, 102, 103, 104],
                       "name":["Naman", "Shubham", "Aditi", "Ved"]
                      })

skills = pd.DataFrame({"reg_no":[101, 102, 102, 103],
                       "skill":["linux", "web dev", "data mining", "DB admin"]
                      })

print(names)
print("-" * 20)
print(skills)

   reg_no     name
0     101    Naman
1     102  Shubham
2     103    Aditi
3     104      Ved
--------------------
   reg_no        skill
0     101        linux
1     102      web dev
2     102  data mining
3     103     DB admin


In [11]:
names.merge(skills, on="reg_no")

## by default, this results in "inner" join

Unnamed: 0,reg_no,name,skill
0,101,Naman,linux
1,102,Shubham,web dev
2,102,Shubham,data mining
3,103,Aditi,DB admin


If the name of columns of each dataframes, upon which the merge is happening, are different. However their values are same. Then :

In [12]:
names.rename(columns={"reg_no":"id"}, inplace=True)

In [13]:
print(names)
print("-" * 20)
print(skills)

    id     name
0  101    Naman
1  102  Shubham
2  103    Aditi
3  104      Ved
--------------------
   reg_no        skill
0     101        linux
1     102      web dev
2     102  data mining
3     103     DB admin


In [14]:
names.merge(skills, left_on="id", right_on="reg_no")

## this also results in "inner" join, by default.

Unnamed: 0,id,name,reg_no,skill
0,101,Naman,101,linux
1,102,Shubham,102,web dev
2,102,Shubham,102,data mining
3,103,Aditi,103,DB admin


- Specify the column names on which merge would happen. `left_on` and `right_on` parameters correspond to left and right dataframes, respectively.

However, both joining-columns would now appear which looks redundant.

In [15]:
## outer join

names.merge(skills, left_on="id", right_on="reg_no", how="outer")

Unnamed: 0,id,name,reg_no,skill
0,101,Naman,101.0,linux
1,102,Shubham,102.0,web dev
2,102,Shubham,102.0,data mining
3,103,Aditi,103.0,DB admin
4,104,Ved,,


<hr>

## Dealing with an IMDB dataset

In [8]:
movies    = pd.read_csv("../datasets/movies_dataset/movies.csv", index_col=0)
directors = pd.read_csv("../datasets/movies_dataset/directors.csv", index_col=0)

In [9]:
movies.head()

Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday
2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday
5,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday


In [10]:
directors.head()

Unnamed: 0,director_name,id,gender
0,James Cameron,4762,Male
1,Gore Verbinski,4763,Male
2,Sam Mendes,4764,Male
3,Christopher Nolan,4765,Male
4,Andrew Stanton,4766,Male


For absent column name/label, Pandas would itself name such column as `Undefined`. As it may seem redundant, such columns can be dropped.

Here in both the datasets, we can also turn any column to be used as the implicit index column -> via the `index_col` parameter.<br>
Specify the implicit index of that particular column (here, `0`), which you want to turn into the index column.

In [25]:
movies.shape

(1465, 11)

In [26]:
directors.shape

(2349, 3)

- Check `df.info()` to get an idea of distribution of missing values.
- It gives no. of non-null (i.e, Available) values in each column.

In [27]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1465 entries, 0 to 4768
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            1465 non-null   int64  
 1   budget        1465 non-null   int64  
 2   popularity    1465 non-null   int64  
 3   revenue       1465 non-null   int64  
 4   title         1465 non-null   object 
 5   vote_average  1465 non-null   float64
 6   vote_count    1465 non-null   int64  
 7   director_id   1465 non-null   int64  
 8   year          1465 non-null   int64  
 9   month         1465 non-null   object 
 10  day           1465 non-null   object 
dtypes: float64(1), int64(7), object(3)
memory usage: 137.3+ KB


In [28]:
movies.describe()

Unnamed: 0,id,budget,popularity,revenue,vote_average,vote_count,director_id,year
count,1465.0,1465.0,1465.0,1465.0,1465.0,1465.0,1465.0,1465.0
mean,45225.191126,48022950.0,30.855973,143253900.0,6.368191,1146.396587,5040.192491,2002.615017
std,1189.096396,49355410.0,34.845214,206491800.0,0.818033,1578.077438,258.059631,8.680141
min,43597.0,0.0,0.0,0.0,3.0,1.0,4762.0,1976.0
25%,44236.0,14000000.0,11.0,17380130.0,5.9,216.0,4845.0,1998.0
50%,45022.0,33000000.0,23.0,75781640.0,6.4,571.0,4964.0,2004.0
75%,45990.0,66000000.0,41.0,179246900.0,6.9,1387.0,5179.0,2009.0
max,48395.0,380000000.0,724.0,2787965000.0,8.3,13752.0,6204.0,2016.0


In [29]:
directors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2349 entries, 0 to 2348
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   director_name  2349 non-null   object
 1   id             2349 non-null   int64 
 2   gender         1724 non-null   object
dtypes: int64(1), object(2)
memory usage: 73.4+ KB


In [30]:
directors.describe()

Unnamed: 0,id
count,2349.0
mean,5936.0
std,678.242213
min,4762.0
25%,5349.0
50%,5936.0
75%,6523.0
max,7110.0


__Scenario 1 :__

- Some of the values are missing - `NaN` in gender.

But other than that is there any other missing values present ?<br>
Lets check for this -

In [32]:
movies.loc[(movies['budget'] == 0.0) | (movies['revenue'] == 0.0)]

Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
321,43918,0,16,104907746,The Campaign,5.6,578,4948,2012,Aug,Thursday
453,44050,0,25,0,The Pink Panther,5.6,550,4842,2006,Jan,Wednesday
475,44072,0,20,43312294,The Edge,6.7,349,4859,1997,Sep,Saturday
536,44133,75000000,13,0,Anna and the King,6.4,197,5039,1999,Dec,Thursday
584,44181,0,13,0,Wolf,6.0,216,5031,1994,Jun,Friday
...,...,...,...,...,...,...,...,...,...,...,...
4696,48323,0,5,0,The Mighty,7.1,51,4921,1998,Oct,Friday
4732,48359,0,2,0,George Washington,6.4,36,5231,2000,Oct,Sunday
4736,48363,0,3,321952,The Last Waltz,7.9,64,4809,1978,May,Monday
4748,48375,0,7,0,Rampage,6.0,131,5148,2009,Aug,Friday


Usually, Missing value in data can appear in 3 ways :

- `NaN`
- `NAN`
- `nan`

### But that does not always hold true !!

As you can see, movies with zero budget or zero revenue is not possible, that are nothing but the __missing values__ here.

So you have to be intuitive to decide for missing values in the dataset.

Let's convert all missing values in columns to `NaN` for better understanding :

In [33]:
movies['revenue'] = movies['revenue'].replace(0, np.nan)
movies['budget']  = movies['budget'].replace(0, np.nan)

In [34]:
## hence if we check now again,

movies.loc[(movies['budget'] == 0.0) | (movies['revenue'] == 0.0)]

Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day


__Why do we need to handle these missing values?__
- Because most ML and DS algorithms break when they encounter missing data
- Missing data depreciate the performance of our models

<br>

__Question: Now, How can we deal with missing values? What ideas do you have?__
- It is on __case-to-case basis.__
- We have to __pick a method based on the dataset and SITUATION.__
- We have to __check what will work and what not.__

<br>

__Ask yourself: What makes sense and what not?__
- __DON'T just remove OR replace with default value OR replace with mean OR anything else blindly.__
- The way we choose to deal with missing values can be __easily misleading.__

<br>

So, we need to be very careful about how we choose to deal with missing data :
- Use something simple.
- But it should make sense in the given situation.

<br>

___There are many ways to deal with missing values !!___

<hr>

An interesting fact about `NaN` is :

In [35]:
from numpy import NaN, NAN, nan

In [36]:
nan == nan

False

__Why did it come out as `False`?__

_because, Can you compare two infinite or non-existent values?_
- NO !

### So, be careful while searching for missing values using `==`

<br>

__Next, so, How can we find if a value is missing or not?__
- use the Pandas in-built function - `isnull()`

In [39]:
np.nan is nan

True

In [40]:
nan is nan

True

In [41]:
NaN is nan

True

In [44]:
pd.isnull(nan), pd.isnull(NaN), pd.isnull(np.NaN)

(True, True, True)

<br>

__Scenario 2 :__

A situation can be where we need to check, if any director mentioned in `movies` is/are actually absent in the original `directors` dataframe.

Let's check this, via buitlin `isin` function :

(esentially we check that, do all 1465 directors in `movie` exist under 2349 `directors`.)

In [45]:
movies["director_id"].isin(directors["id"])

0       True
1       True
2       True
3       True
5       True
        ... 
4736    True
4743    True
4748    True
4749    True
4768    True
Name: director_id, Length: 1465, dtype: bool

In [46]:
## basically we should check this inside `np.all` to get any False value.

np.all(movies["director_id"].isin(directors["id"]))

True

This means, all the movie directors of `movies` do legitly exist i.e. are part of the super-set dataframe - `directors`

<br>

__Scenario 3:__

Merge 2 dataframes on common columns among `movies` and `directors`.

In [47]:
movies.merge(directors, left_on="director_id", right_on="id", how="left")

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day,director_name,id_y,gender
0,43597,237000000.0,150,2.787965e+09,Avatar,7.2,11800,4762,2009,Dec,Thursday,James Cameron,4762,Male
1,43598,300000000.0,139,9.610000e+08,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday,Gore Verbinski,4763,Male
2,43599,245000000.0,107,8.806746e+08,Spectre,6.3,4466,4764,2015,Oct,Monday,Sam Mendes,4764,Male
3,43600,250000000.0,112,1.084939e+09,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday,Christopher Nolan,4765,Male
4,43602,258000000.0,115,8.908716e+08,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday,Sam Raimi,4767,Male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,,3,3.219520e+05,The Last Waltz,7.9,64,4809,1978,May,Monday,Martin Scorsese,4809,Male
1461,48370,27000.0,19,3.151130e+06,Clerks,7.4,755,5369,1994,Sep,Tuesday,Kevin Smith,5369,Male
1462,48375,,7,,Rampage,6.0,131,5148,2009,Aug,Friday,Uwe Boll,5148,Male
1463,48376,,3,,Slacker,6.4,77,5535,1990,Jul,Friday,Richard Linklater,5535,Male


Notice here an interesting thing :

__From where in the final output merged dataframe has `id_x` and `id_y` columns come up ?__
Actually what happens is,

`id_y` (originally, `id`) => comes from `directors`

`id_x` (originally, `id`) => comes from `movies`

So, in order to eliminate the confusion of duplicate column names; Pandas itself edits such column occurrences via certain suffixes.

- `_x` => came via 1st dataframe, which is mentioned in .merge

- `_y` => means, it came via 2nd dataframe, which is mentined in .merge

In the final merged df, `id_y` and `director_id` are exactly same (because joining happens on that, so anyone can be dropped.)

__Scenario 4:__

How to find the total number of NULL values in each column ?

In [13]:
movies.isnull().sum()

id              0
budget          0
popularity      0
revenue         0
title           0
vote_average    0
vote_count      0
director_id     0
year            0
month           0
day             0
dtype: int64

In [14]:
directors.isnull().sum()

director_name      0
id                 0
gender           625
dtype: int64

In [16]:
directors[directors["gender"].isnull()].head()

## so, these many directors have missing their "gender" column value. In all, 625 missings.

Unnamed: 0,director_name,id,gender
18,Chris Weitz,4780,
40,Justin Lin,4802,
49,David Ayer,4811,
52,Kevin Reynolds,4814,
55,Robert Stromberg,4817,


If you want to check for the percenatge of missing value for each column or features ?

In [18]:
# len(movies) represents total number of rows
mis_val_per_col_movies = movies.isnull().sum() / len(movies)*100

mis_val_per_col_movies

id              0.0
budget          0.0
popularity      0.0
revenue         0.0
title           0.0
vote_average    0.0
vote_count      0.0
director_id     0.0
year            0.0
month           0.0
day             0.0
dtype: float64

In [20]:
mis_val_per_col_directors = directors.isnull().sum() / len(directors)*100

mis_val_per_col_directors

director_name     0.000000
id                0.000000
gender           26.607067
dtype: float64

#### How to get the frequency of null values using `unique()`

- We can get __frequency of each value__ in a column


- __No. of occurrences of each value__ in a column


- Using `value_counts()`

In [21]:
directors["gender"].value_counts()

Male      1574
Female     150
Name: gender, dtype: int64

__It tells that :__

- `Male` occurs 1309 times

- `Female` occurs 32 times

### But its not telling the count of missing values !

- Because by **default**, the **parameter `dropna` is set to `True`**


- `dropna=True` means it is **NOT going to count missing values**


- So, **we have to set `dropna=False`** to get the **count of missing values in a column**

In [23]:
directors['gender'].value_counts(dropna=False)

Male      1574
NaN        625
Female     150
Name: gender, dtype: int64

<br>

__Scenario 5:__

**Handling missing data** :

After finding the missing values, it's time to treat them.

_There are many ways to handle missing values in the dataset._


### Removing/Dropping the missing values

#### What if you have just 1 or very few rows which has missing data, compared to whole data?

- **Example**: only 10 rows out of 1 million rows having missing values


- We can simply remove those rows or columns using **`drop()`** 





#### How to find missing / None i.e. `null` / `NaN` valued data ?

In [15]:
df2 = pd.DataFrame([[np.nan, 2, None, 0],
                    [3, 4, np.nan, 1],
                    [np.nan, np.NAN, np.nan],
                    [np.NaN, 3, 2, np.nan]
                   ],
                   columns=["A","B","C","D"])

df2

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,
3,,3.0,2.0,


In [16]:
df2.isnull()

Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,True
3,True,False,False,True


In [17]:
df2.isna()

Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,True
3,True,False,False,True


In [18]:
df2.isna().sum()


## or, df2.isnull().sum()

A    3
B    1
C    3
D    2
dtype: int64

In [19]:
df2.isna().sum(axis=1)


## or, df2.isnull().sum(axis=1)

0    2
1    1
2    4
3    2
dtype: int64

#### Backfill `nan` values :

In [20]:
df2.fillna(0)

## fill entire `nan` values of dataframe with one single value (lets say integer 0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0.0
1,3.0,4.0,0.0,1.0
2,0.0,0.0,0.0,0.0
3,0.0,3.0,2.0,0.0


In [23]:
## fill only second column - B => access first column -> fill its missing values

df2['B'].fillna(0)

0    2.0
1    4.0
2    0.0
3    3.0
Name: B, dtype: float64

In [25]:
## fill `nan` values in C column, with mean of D column

df2["C"].fillna(df2["D"].mean())

0    0.5
1    0.5
2    0.5
3    2.0
Name: C, dtype: float64

#### - Backward & Forward filling

In [26]:
df2

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,,,,
3,,3.0,2.0,


In [27]:
df2.fillna(method="ffill")

Unnamed: 0,A,B,C,D
0,,2.0,,0.0
1,3.0,4.0,,1.0
2,3.0,4.0,,1.0
3,3.0,3.0,2.0,1.0


In [29]:
df2.fillna(method="bfill")

Unnamed: 0,A,B,C,D
0,3.0,2.0,2.0,0.0
1,3.0,4.0,2.0,1.0
2,,3.0,2.0,
3,,3.0,2.0,


<br>

__Scenario 6:__

Replacing missing values with mean.

- use `any()`

#### How to filter rows that contains atleast one missing value ?

We can do it using `any()` method :
 
 - The any() function is used to check whether any element is `True`, potentially over an axis.

In [44]:
movies_mis_data = pd.read_csv("../datasets/movies_dataset/movies_mis_data.csv", index_col=0)

movies_mis_data

Unnamed: 0,id,popularity,title,vote_average,vote_count,year,month,day,director_name,gender,revenue_Mdollars,budget_Mdollars
0,43597,150,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,2787.97,237.00
1,43598,139,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,961.00,300.00
2,43599,107,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,880.67,245.00
3,43600,112,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,1084.94,250.00
4,43602,115,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,890.87,258.00
...,...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,3,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,0.32,0.00
1461,48370,19,Clerks,7.4,755,1994,Sep,Tuesday,Kevin Smith,Male,3.15,0.03
1462,48375,7,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,0.00,0.00
1463,48376,3,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male,0.00,0.00


__What sort of erroneous data can be found in a movies dataset ? :__

This should sound misleading that a movie's budget or revenue was $ 0.0 . Hence such records should be replace with `NaN`.

In [45]:
movies_mis_data.loc[(movies_mis_data['revenue_Mdollars'] == 0.0) | (movies_mis_data['budget_Mdollars'] == 0.0)]

Unnamed: 0,id,popularity,title,vote_average,vote_count,year,month,day,director_name,gender,revenue_Mdollars,budget_Mdollars
195,43918,16,The Campaign,5.6,578,2012,Aug,Thursday,Jay Roach,Male,104.91,0.0
271,44050,25,The Pink Panther,5.6,550,2006,Jan,Wednesday,Shawn Levy,Male,0.00,0.0
281,44072,20,The Edge,6.7,349,1997,Sep,Saturday,Lee Tamahori,Male,43.31,0.0
314,44133,13,Anna and the King,6.4,197,1999,Dec,Thursday,Andy Tennant,Male,0.00,75.0
341,44181,13,Wolf,6.0,216,1994,Jun,Friday,Mike Nichols,Male,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1457,48323,5,The Mighty,7.1,51,1998,Oct,Friday,Peter Chelsom,Male,0.00,0.0
1459,48359,2,George Washington,6.4,36,2000,Oct,Sunday,David Gordon Green,Male,0.00,0.0
1460,48363,3,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,0.32,0.0
1462,48375,7,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,0.00,0.0


In [46]:
movies_mis_data['revenue_Mdollars'] = movies_mis_data['revenue_Mdollars'].replace(0, np.nan)

movies_mis_data['budget_Mdollars'] = movies_mis_data['budget_Mdollars'].replace(0, np.nan)

In [47]:
# Let's check the rows with null values before replace

movies_mis_data[movies_mis_data.isnull().any(axis=1)]

Unnamed: 0,id,popularity,title,vote_average,vote_count,year,month,day,director_name,gender,revenue_Mdollars,budget_Mdollars
17,43620,42,The Golden Compass,5.8,1303,2007,Dec,Tuesday,Chris Weitz,,372.23,180.00
38,43653,65,Star Trek Beyond,6.6,2568,2016,Jul,Thursday,Justin Lin,,343.47,185.00
50,43669,90,Suicide Squad,5.9,7458,2016,Aug,Tuesday,David Ayer,,745.00,175.00
53,43672,44,Waterworld,5.9,992,1995,Jul,Friday,Kevin Reynolds,,264.22,175.00
81,43729,44,Wrath of the Titans,5.5,1431,2012,Mar,Tuesday,Jonathan Liebesman,,301.00,150.00
...,...,...,...,...,...,...,...,...,...,...,...,...
1459,48359,2,George Washington,6.4,36,2000,Oct,Sunday,David Gordon Green,Male,,
1460,48363,3,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,0.32,
1462,48375,7,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,,
1463,48376,3,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male,,


- `any()` method returns one value for each column => `True` if ANY value in that column is `True`, otherwise `False`.

- Here `.any()` checks if any value in a column is `null`

<br><br>

**Now lets say you want to check for specific column say "revenue_Mdollars" :**

In [48]:
# Now check the null values only in revenue column

movies_mis_data.loc[movies_mis_data[["revenue_Mdollars"]].isna().any(axis=1)]

Unnamed: 0,id,popularity,title,vote_average,vote_count,year,month,day,director_name,gender,revenue_Mdollars,budget_Mdollars
271,44050,25,The Pink Panther,5.6,550,2006,Jan,Wednesday,Shawn Levy,Male,,
314,44133,13,Anna and the King,6.4,197,1999,Dec,Thursday,Andy Tennant,Male,,75.0
341,44181,13,Wolf,6.0,216,1994,Jun,Friday,Mike Nichols,Male,,
351,44198,9,Rollerball,3.4,106,2002,Feb,Friday,John McTiernan,,,
374,44245,19,Mona Lisa Smile,6.5,393,2003,Dec,Friday,Mike Newell,Male,,65.0
...,...,...,...,...,...,...,...,...,...,...,...,...
1453,48294,6,She's Gotta Have It,6.1,25,1986,Aug,Friday,Spike Lee,Male,,
1457,48323,5,The Mighty,7.1,51,1998,Oct,Friday,Peter Chelsom,Male,,
1459,48359,2,George Washington,6.4,36,2000,Oct,Sunday,David Gordon Green,Male,,
1462,48375,7,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,,


<br>

__Scenario 7:__

#### Filling null values in revenue using mean => `replace(to_replace, value, inplace=True)`

Lets first calculate the mean of revenue column..


- **`to_replace`** - What you want to replace?


- **`value`** - With what value you want to replace?


- **`inplace=True`** - to do permanent change.

    <u>Note :</u>
    - **`value` - can be anything like :**
        - mean
        - median
        - mode
        - 0
        - min, max
        - ...

Similarly, you can try to fill the missing values for column budget 

In [50]:
m = movies_mis_data['revenue_Mdollars'].mean()

m

165.379866036249

In [52]:
movies_mis_data['revenue_Mdollars'].replace(to_replace=np.NaN, value=m, inplace=True)

movies_mis_data.tail()

Unnamed: 0,id,popularity,title,vote_average,vote_count,year,month,day,director_name,gender,revenue_Mdollars,budget_Mdollars
1460,48363,3,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,0.32,
1461,48370,19,Clerks,7.4,755,1994,Sep,Tuesday,Kevin Smith,Male,3.15,0.03
1462,48375,7,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,165.379866,
1463,48376,3,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male,165.379866,
1464,48395,14,El Mariachi,6.6,238,1992,Sep,Friday,Robert Rodriguez,,2.04,0.22


Now we can see in above results, the nan values at index `1462` and `1463` is replaced with mean value __165.37__.

<br>

__Scenario 8: Replace missing values with mode__

After checking for numerical values, now lets see how to handle missing values in categorical column `gender' :

In [53]:
movies_mis_data['gender'].value_counts(dropna=False)

Male      1309
NaN        124
Female      32
Name: gender, dtype: int64

In case of categorical we can replace the missing value by the most frequent value for that particular column,

__Here we replace the missing value by Male since the count of `Male` is more than `Female`__

In [54]:
#Take most occured category in that vairable (.mode())
Mode_Category = movies_mis_data['gender'].mode()

type(Mode_Category)

pandas.core.series.Series

#### Please note : `mode()` always returns Series even if only one value is returned.

So inorder to get first element of `mode()` we have to use `index[0]`.

In [56]:
Mode_Category = movies_mis_data['gender'].mode()[0]

Mode_Category

'Male'

Now replace `NAN` values with most occured category in actual variable

__NOTE: Features having a max number of null values may create bias if replace with the most occurred category.__

In [57]:
movies_mis_data['gender'].replace(to_replace=np.NaN, value=Mode_Category, inplace=True)

movies_mis_data['gender'].value_counts(dropna=False)

Male      1433
Female      32
Name: gender, dtype: int64

In [58]:
movies_mis_data

Unnamed: 0,id,popularity,title,vote_average,vote_count,year,month,day,director_name,gender,revenue_Mdollars,budget_Mdollars
0,43597,150,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,2787.970000,237.00
1,43598,139,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,961.000000,300.00
2,43599,107,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,880.670000,245.00
3,43600,112,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,1084.940000,250.00
4,43602,115,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,890.870000,258.00
...,...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,3,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,0.320000,
1461,48370,19,Clerks,7.4,755,1994,Sep,Tuesday,Kevin Smith,Male,3.150000,0.03
1462,48375,7,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,165.379866,
1463,48376,3,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male,165.379866,


#### Scenario 9: Finding risky movie, using `apply` function

Now, our goal is to find out movies which turn out to be risky.

This can be found via the logic that, if average revenue of the director is more than the movie's budget, then the movie can be said as "risky". To do so, we can groupby the dataframe by the director's name, and using `apply` function, find the required data.

Let's first define the function to calculate the difference.

#### Using a function :

In [32]:
movies = pd.read_csv("../datasets/movies_dataset/movies.csv", index_col=0)
directors = pd.read_csv("../datasets/movies_dataset/directors.csv", index_col=0)

In [33]:
movies.head()

Unnamed: 0,id,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday
2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday
5,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday


In [34]:
directors.head()

Unnamed: 0,director_name,id,gender
0,James Cameron,4762,Male
1,Gore Verbinski,4763,Male
2,Sam Mendes,4764,Male
3,Christopher Nolan,4765,Male
4,Andrew Stanton,4766,Male


<br>

Now let's merge our datasets and create a copy to build our final dataset

In [35]:
data = movies.merge(directors, how='left', left_on='director_id',right_on='id')

data.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,director_id,year,month,day,director_name,id_y,gender
0,43597,237000000,150,2787965087,Avatar,7.2,11800,4762,2009,Dec,Thursday,James Cameron,4762,Male
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,4763,2007,May,Saturday,Gore Verbinski,4763,Male
2,43599,245000000,107,880674609,Spectre,6.3,4466,4764,2015,Oct,Monday,Sam Mendes,4764,Male
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,4765,2012,Jul,Monday,Christopher Nolan,4765,Male
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,4767,2007,May,Tuesday,Sam Raimi,4767,Male


<br>

__Drop multiple columns__ :
- pass all such column names in a list.
- Specify `axis=1` , because removing  column is indeed a horizontal operation.

In [36]:
data.drop(['director_id','id_y'], axis=1, inplace=True)

data.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male


In [37]:
df = data.copy(deep=True)

## just creating a copy for downstream use

In [38]:
def calculate_risky(x):
    return x["budget"] - x["revenue"].mean() > 0

Now, on using the groupby.

In [39]:
df.groupby("director_name").apply(calculate_risky)

director_name      
Adam McKay     176     False
               323     False
               366     False
               505     False
               839     False
                       ...  
Zhang Yimou    590     False
               604     False
               1217    False
               1223    False
               1389    False
Name: budget, Length: 1465, dtype: bool

<br>

This is fine, but as we see, the output is __multiindex__.

Now what issue can this create? - Say if we try to set this to a new column, __"risky"__

In [40]:
data["risky"] = data.groupby("director_name").apply(calculate_risky)

TypeError: incompatible index of inserted column with frame index

<br>

The error basically says, that since our output is multi-index, we can't assign it to a single column.

We need our output as a single series. To solve this issue, there are two methods.

#### Method 1: Creating the column and assigning it in the function itself

In [46]:
def calculate_risky(x):
    x["risky"] = x["budget"] - x["revenue"].mean() > 0
    return x


df = data.groupby("director_name").apply(calculate_risky)

df

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender,risky
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,False
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,False
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,False
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,False
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,0,3,321952,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,False
1461,48370,27000,19,3151130,Clerks,7.4,755,1994,Sep,Tuesday,Kevin Smith,Male,False
1462,48375,0,7,0,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,False
1463,48376,0,3,0,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male,False


<br>

#### Method 2: Using the parameter `group_keys = False`, while using groupby.

We know that `groupby`, groups the data based on lexicographical order. This could create a problem while assigning the values. Hence, we can use `group_keys = False` while using groupby, to return single index values.

Let's look at it's implementation :

In [47]:
def calc_risk_new(x):
    return x["budget"] - x["revenue"].mean() > 0


df["risky_new"] = df.groupby("director_name", group_keys = False).apply(calc_risk_new)

df

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender,risky,risky_new
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,False,False
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,False,False
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,False,False
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,False,False
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1460,48363,0,3,321952,The Last Waltz,7.9,64,1978,May,Monday,Martin Scorsese,Male,False,False
1461,48370,27000,19,3151130,Clerks,7.4,755,1994,Sep,Tuesday,Kevin Smith,Male,False,False
1462,48375,0,7,0,Rampage,6.0,131,2009,Aug,Friday,Uwe Boll,Male,False,False
1463,48376,0,3,0,Slacker,6.4,77,1990,Jul,Friday,Richard Linklater,Male,False,False


<br>

This worked just fine. What does actually happen though? If we print the values of `calc_risk_new`, we can understand what it is returning :

In [48]:
df.groupby('director_name', group_keys = False).apply(calc_risk_new)

176     False
323     False
366     False
505     False
839     False
        ...  
590     False
604     False
1217    False
1223    False
1389    False
Name: budget, Length: 1465, dtype: bool

<br>

On using `group_keys = True`, we get the multi-index result, which is the default value.

In [49]:
df.groupby("director_name", group_keys = True).apply(calc_risk_new)

director_name      
Adam McKay     176     False
               323     False
               366     False
               505     False
               839     False
                       ...  
Zhang Yimou    590     False
               604     False
               1217    False
               1223    False
               1389    False
Name: budget, Length: 1465, dtype: bool

<br>

Hence, to create and assign these values to a column, we need to either assign the column in the function itself, or use the parameter `group_keys = True`

In [51]:
# Check if both the columns values are same

import numpy as np

np.all(df['risky'] == df['risky_new'])

True

<br>

#### Using Lambda

The same can be achieved using lambda function too, though in this case we will need to use `group_keys = False`, since we cannot assign a column to a dataframe in lambda single line function definition.

In [53]:
df.groupby('director_name', group_keys = True).apply(lambda x: x.budget - x['revenue'].mean()) > 0

director_name      
Adam McKay     176     False
               323     False
               366     False
               505     False
               839     False
                       ...  
Zhang Yimou    590     False
               604     False
               1217    False
               1223    False
               1389    False
Name: budget, Length: 1465, dtype: bool

<br>

This is exactly in line with the result we were getting earlier while using the functions.

If we try to print the indexes :

In [54]:
df.groupby("director_name").apply((lambda x: (x.budget - x["revenue"].mean() > 0).index))

director_name
Adam McKay                     Int64Index([176, 323, 366, 505, 839, 916], dty...
Adam Shankman                  Int64Index([265, 300, 350, 404, 458, 843, 999,...
Alejandro González Iñárritu    Int64Index([106, 749, 1015, 1034, 1077, 1405],...
Alex Proyas                    Int64Index([95, 159, 514, 671, 873], dtype='in...
Alexander Payne                Int64Index([793, 1006, 1101, 1211, 1281], dtyp...
                                                     ...                        
Wes Craven                     Int64Index([620, 651, 714, 734, 887, 932, 952,...
Wolfgang Petersen              Int64Index([65, 87, 132, 235, 515, 872, 1216],...
Woody Allen                    Int64Index([ 799,  895,  985, 1038, 1044, 1046...
Zack Snyder                    Int64Index([5, 10, 97, 187, 317, 396, 842], dt...
Zhang Yimou                    Int64Index([192, 590, 604, 1217, 1223, 1389], ...
Length: 199, dtype: object

This confirms our confusion about the multi indexes. Now, we can simply use `group_keys = False` to avoid this.

In [55]:
df.groupby('director_name', group_keys = False).apply(lambda x: x.budget - x['revenue'].mean()) > 0

176     False
323     False
366     False
505     False
839     False
        ...  
590     False
604     False
1217    False
1223    False
1389    False
Name: budget, Length: 1465, dtype: bool

In [56]:
df['lambda_risky'] = df.groupby('director_name', group_keys = False).apply(lambda x: x['budget'] - x['revenue'].mean()) > 0

df.head()

Unnamed: 0,id_x,budget,popularity,revenue,title,vote_average,vote_count,year,month,day,director_name,gender,risky,risky_new,lambda_risky
0,43597,237000000,150,2787965087,Avatar,7.2,11800,2009,Dec,Thursday,James Cameron,Male,False,False,False
1,43598,300000000,139,961000000,Pirates of the Caribbean: At World's End,6.9,4500,2007,May,Saturday,Gore Verbinski,Male,False,False,False
2,43599,245000000,107,880674609,Spectre,6.3,4466,2015,Oct,Monday,Sam Mendes,Male,False,False,False
3,43600,250000000,112,1084939099,The Dark Knight Rises,7.6,9106,2012,Jul,Monday,Christopher Nolan,Male,False,False,False
4,43602,258000000,115,890871626,Spider-Man 3,5.9,3576,2007,May,Tuesday,Sam Raimi,Male,False,False,False


In [57]:
np.all(df["risky"] == df["lambda_risky"])

True

__Key Takeaways :__

1. Using a function or lambda function, `with apply`, after using groupby on a dataframe, gives multiindex output.


2. We need to either "return the whole dataframe" (after creating a new column) OR do `group_keys = False`