In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra

import pandas as pd
# pandas defaults
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# 1. Reading the Data

In [None]:
imdb = pd.read_csv("../input/IMDB-Movie-Data.csv")

In [None]:
# top 5 rows
imdb.head()

In [None]:
#renaming some cols
imdb.rename(columns = {'Revenue (Millions)':'Rev_M','Runtime (Minutes)':'Runtime_min'},inplace=True)

# Creating a Column

You can create a new column in many ways.
If you want a column that is a sum or difference of columns, you can pretty much use simple basic arithmetic. Here I get the average rating based on IMDB and Normalized Metascore.

In [None]:
imdb['AvgRating'] = (imdb['Rating'] + imdb['Metascore']/10)/2

But sometimes we may need to build complex logic around the creation of new columns.
To give you a convoluted example, let's say that we want to build a custom movie score based on a variety of factors.

Say, If the movie is of the thriller genre, I want to add 1 to the IMDB rating subject to the condition that IMDB rating remains less than or equal to 10. And If a movie is a comedy I want to subtract 1 from the rating.

How do we do that?
Whenever I get a hold of such complex problems, I use apply/lambda. Let me first show you how I will do this.

In [None]:
def custom_rating(genre,rating):
    if 'Thriller' in genre:
        return min(10,rating+1)
    elif 'Comedy' in genre:
        return max(0,rating-1)
    else:
        return rating
        
imdb['CustomRating'] = imdb.apply(lambda x: custom_rating(x['Genre'],x['Rating']),axis=1)

The general structure is:
- You define a function that will take the column values you want to play with to come up with your logic. Here the only two columns we end up using are genre and rating.
- You use an apply function with lambda along the row with axis=1. The general syntax is:

```df.apply(lambda x: func(x['col1'],x['col2']),axis=1)```

You should be able to create pretty much any logic using apply/lambda since you just have to worry about the custom function.

# Filtering a dataframe

Pandas make filtering and subsetting dataframes pretty easy. You can filter and subset dataframes using normal operators and &,|,~ operators.

In [None]:
# Single condition: dataframe with all movies rated greater than 8
imdb_gt_8 = imdb[imdb['Rating']>8]

imdb_gt_8.head()

In [None]:
# Multiple conditions: AND - dataframe with all movies rated greater than 8 and having more than 100000 votes

And_imdb = imdb[(imdb['Rating']>8) & (imdb['Votes']>100000)]

And_imdb.head()

In [None]:
# Multiple conditions: OR - dataframe with all movies rated greater than 8 or having a metascore more than 90

Or_imdb = imdb[(imdb['Rating']>8) | (imdb['Metascore']>80)]
Or_imdb.head()


In [None]:
# Multiple conditions: NOT - dataframe with all emovies rated greater than 8 or having a metascore more than 90 have to be excluded

Not_imdb = imdb[~((imdb['Rating']>8) | (imdb['Metascore']>80))]
Not_imdb.head()

Pretty simple stuff. 

But sometimes we may need to do complex filtering operations.

And sometimes we need to do some operations which we won't be able to do using just the above format.

For instance: Let us say we want to filter those rows where the number of words in the movie title is greater than or equal to than 4.
How would you do it? 

Trying the below will give you an error. Apparently, you cannot do anything as simple as split with a series.

In [None]:
# Single condition: dataframe with all movies rated greater than 8
imdb_gt_8 = imdb[imdb['Rating']>8]

# Multiple conditions: AND - dataframe with all movies rated greater than 8 and having more than 100000 votes
And_imdb = imdb[(imdb['Rating']>8) & (imdb['Votes']>100000)]

# Multiple conditions: OR - dataframe with all movies rated greater than 8 or having a metascore more than 90
Or_imdb = imdb[(imdb['Rating']>8) | (imdb['Metascore']>80)]

# Multiple conditions: NOT - dataframe with all emovies rated greater than 8 or having a metascore more than 90 have to be excluded
Not_imdb = imdb[~((imdb['Rating']>8) | (imdb['Metascore']>80))]

In [None]:
new_imdb = imdb[len(imdb['Title'].split(" "))>=4]


One way is to first create a column which contains no of words in the title using apply and then filter on that column.

In [None]:
#create a new column
imdb['num_words_title'] = imdb.apply(lambda x : len(x['Title'].split(" ")),axis=1)
#simple filter on new column
new_imdb = imdb[imdb['num_words_title']>=4]
new_imdb.head()

if length of title >=4 and distinct genres >=2:
    if Rating>Metascore/10:
        if year>2013:
    else:
        if year<2012

And that is a perfectly fine way as long as you don't have to create a lot of columns. But, I prefer this:

In [None]:
new_imdb = imdb[imdb.apply(lambda x : len(x['Title'].split(" "))>=4,axis=1)]
new_imdb.head()

What I did here is that my apply function returns a boolean which can be used to filter.

Now once you understand that you just have to create a column of booleans to filter, you can use any function/logic in your apply statement to get however complex a logic you want to build.

Let us see another example. I will try to do something a little complex to just show the structure.

We want to find movies for which the revenue is less than the average revenue for that particular year?

In [None]:
year_revenue_dict = imdb.groupby(['Year']).agg({'Rev_M':np.mean}).to_dict()['Rev_M']
def bool_provider(revenue, year):
    return revenue<year_revenue_dict[year]
    
new_imdb = imdb[imdb.apply(lambda x : bool_provider(x['Rev_M'],x['Year']),axis=1)]

new_imdb.head()

We have a function here which we can use to write any logic. 
That provides a lot of power for advanced filtering as long as we can play with simple variables.

# Change Column Types

I even use apply to change the column types since I don't want to remember the syntax for changing column type and also since it lets me do much more complex things. 
The normal syntax to change column type is astype in Pandas. So if I had a column named price in my data in an str format. I could do this:

```df['Price'] = newDf['Price'].astype('int')```

But sometimes it won't work as expected. 
You might get the error: ValueError: invalid literal for long() with base 10: '13,000'. That is you cannot cast a string with "," to an int. To do that we first have to get rid of the comma. 
After facing this problem time and again, I have stopped using astype altogether now and just use apply to change column types.

```df['Price'] = df.apply(lambda x: int(x['Price'].replace(',', '')),axis=1)```

# And lastly there is progress_apply

progress_apply is a single function that comes with tqdm package. 

And this has saved me a lot of time.

Sometimes when you have got a lot of rows in your data, or you end up writing a pretty complex apply function, you will see that apply might take a lot of time.

I have seen apply taking hours when working with Spacy. In such cases, you might like to see the progress bar with apply. 

You can use tqdm for that.

After the initial imports at the top of your notebook, just replace apply with progress_apply and everything remains the same.

In [None]:
from tqdm import tqdm, tqdm_notebook
tqdm_notebook().pandas()

new_imdb['rating_custom'] = imdb.progress_apply(lambda x: custom_rating(x['Genre'],x['Rating']),axis=1)


In [None]:
new_imdb.head()

# Conclusion

apply and lambda functionality lets you take care of a lot of complex things while manipulating data. 

I feel that I don't have to worry about a lot of stuff while using Pandas since I can use apply well. 

In this post, I tried to explain how it works. And there might be other ways to do whatever I have done above. 

But I like to stick with apply/lambda in place of map/applymap because I find it more readable and well suited to my workflow.
