# Tutorial 2 : Data Processing with Pandas


###  By the end of this notebook, you will learn how to:

- Group and aggregate data to compute summary statistics across categories  
- Use `apply()` and `lambda` to perform custom operations on rows or columns 
- Clean and transform textual data using string methods and filtering techniques  
- Visualize patterns in data using built-in Pandas plotting functionality


## Loading the IMDB Movie Dataset

We will continue to work on the IMDB movie dataset from the previous tutorial. 


In [None]:
import pandas as pd

In [None]:
#If we set index_col=0, we're explicitly stating to treat the first column as the index:
movies_df = pd.read_csv("https://raw.githubusercontent.com/rnanda17/data_science_BE/refs/heads/main/IMDB-Movie-Data.csv", index_col=0)

movies_df.head(5)

## Aggregation

So far, we have been calculating statistics for a single column. But what if we want to calculate statistics for multiple columns? For that, we can use aggregation methods.

We will learn about the `DataFrame.groupby` method. This method allows us to group rows based on a column and then calculate statistics for each group.

For example, if we want to calculate the average rating for each director, we can do the following:

In [None]:
movies_df.groupby("Director")["Rating"].mean()

We can analyse multiple columns at the same time by passing a list of column names to `DataFrame.groupby`. For example, if we want to calculate the average rating and Metascore for each director, we can do the following:

In [None]:
movies_df.groupby("Director")[["Rating", "Metascore"]].mean()

There are many other aggregation methods that you can use with `DataFrame.groupby`. You can find a list of them https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html.

## Sorting

To sort our DataFrame (or a subset of it) by a column, we can use the `DataFrame.sort_values` method. For instance, to sort movies by rating in descending order, we can do the following:

In [None]:
movies_df.sort_values("Rating", ascending=False)

### Task 1 . Solve the exercises below

#### Exercise 1.1: What are the average ratings of movies by year?

Hint: Group the data by `Year` and calculate the **average `Rating`** for each year.

####  Exercise 1.2: Which Genres Are the Most Highly Rated?

Hint: Group the data by `Genre` and calculate the **average rating** for each genre.


#### 💰 Exercise 1.3: Which Directors bring the Most Revenue?


### Counting Unique Values

To count the number of unique values in a dataframe's column, we can use the `nunique()` method.

We can also use the `unique()` method to get a list of all the unique values in a column.

For instance, let's get the number of unique directors:

In [None]:
movies_df.Director.nunique()

In [None]:
# And if we want to get the list of unique directors:
print(movies_df.Director.unique())

### Counting Specific Values

What if we want to check the number of movies directed by a specific director? We can use the `value_counts()` method.

`value_counts()` counts how many times each value appears in a column. It returns a Series, with the values as the index and the counts as the values.

We can also get the normalized (percentage) value counts with the `normalize` parameter.

In [None]:
# Getting the number of movies directed by each director:
print(movies_df.Director.value_counts())

In [None]:
# Normalized value counts.
print((movies_df.Director.value_counts(normalize=True)*100))

In [None]:
# The percentage must add up to 100%. Let's check that:
print((movies_df.Director.value_counts(normalize=True)*100).sum())

### Counting Values Within Groups

We can also count values within groups. The basic function for this is `count()`.

`count()` counts the number of non-null values in each group. It returns a Series, with the groups as the index and the counts as the values.

We can use the previous methods and apply them to groups (with the `groupby()` method).

For instance, let's count the number of movies per genre.

In [None]:
# Using count():
movies_df.groupby("Genre").Title.count()

Now combine the different count methods to answer the following question: What is the unique number of directors per genre?

In [None]:
# What is the unique number of directors per genre?
display(movies_df.groupby(by="Genre")["Director"].nunique())

## Processing Text

Carefully, examine the `movies_df` DataFrame and you can see that the `Actors` column contains a list of actors in a string, separated by commas.

We can split the string into a list of actors so we can do some data processing operations on individual actors.

### Splitting Strings

We can do this with the `str.split()` method.

`str` (short for string) is a special attribute of Pandas Series and DataFrames. It contains a lot of useful methods for processing text.

Check out the documentation for more information: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

Before, we use the `str.split()` method, it is important to observe that in the `Actors` column of the `movies_df` DataFrame, some actors are separated by `,` and some by `, ` . So, in some cases the separator is just a comma and in some cases it is a comma followed by a space. So if we use the `str.split()` method without fixing this, this would lead to inconsistent splitting of actor names. So first we will use `str.replace()` methods which allows to replace a substring with another substring. We will replace the commas with space by just commas (with no space). 

In [None]:
# replace commas with spaces with just commas to ensure consistency for splitting actor names
movies_df["Actors"] = movies_df.Actors.str.replace(", ", ",")  


In [None]:
# The str.split() method takes a separator as a parameter. By default, it splits on whitespace. In our case, we want to split on commas.
movies_df.Actors.str.split(",")

So `str.split()` turned the string into a list of strings. But how do we add each actor to the DataFrame?

The easiest option is to add each actor as a new row.

We can do this with the `explode()` method.

`explode()` takes a column with lists and turns each element of the list into a new row.

In [None]:
movies_df["Actors"] = movies_df.Actors.str.split(",")
# We create a new df with the "exploded" column
movies_df_actors = movies_df.explode("Actors").copy()

Let's have a look at what it does

In [None]:
movies_df_actors.explode("Actors")

Now we have each actor as a separate row. `explode()` copies the other columns for each new row. Now the length of the DataFrame is the sum of the lengths of the lists in the `Actors` column.

Let's check a few things:

- Number of unique actors
- Number of movies per actor
- Number of movies per actor per year

In [None]:
# Number of unique actors
movies_df_actors.Actors.nunique()

In [None]:
# Number of movies per actor
movies_df_actors.groupby("Actors")["Title"].nunique()

In [None]:
# Number of movies per actor per year
movies_df_actors.groupby(["Actors", "Year"])["Title"].nunique()

## Aggregating

Now that we have the actors as separete rows, we can aggregate their data to get some interesting insights.

We have already learned the `groupby()` method, which is the basic function for aggregating data. Let's review it:

Let's find out which directors worked with the highest number of different actors.

In [None]:
movies_df_actors.groupby("Director").Actors.nunique().sort_values(ascending=False).head(5)

### `sum()`

`sum()` sums the values in each group. It can also be used to count the number of non-null (or True) values.

The other basic statistical functions are `mean()`, `median()`, `min()`, `max()`, and `std()`. They all work the same way.

Let's check the mean revenue of each actor.

In [None]:
movies_df_actors.groupby("Actors")["Revenue (Millions)"].mean().sort_values(ascending=False)

####  apply () and lambda functions with Pandas

`apply()` can be used along with the Python lambda function to apply a custom operation to all columns in a DataFrame. A lambda function is a small anonymous function that can take any number of arguments and execute an expression. We learned about this in the Pre-Tutorial Guide. 

Let's see a simple example first. We create a simple dataframe with three rows and three columns. 

In [None]:
data = [(3,5,7), (2,4,6),(5,8,9)]
df = pd.DataFrame(data, columns = ['A','B','C'])
df

Now lets use the `lambda` and `apply()` function to add `10` to all the elements of the dataframe. We store this in a new dataframe, `df2`.

In [None]:
# Apply a lambda function to each column
df2 = df.apply(lambda x : x + 10)
df2

## 🎬 Task 2 Exercise: Classify Movies by Length Using `apply()` and `lambda`

In this exercise, you'll create a new column called **`Length Category`** for the `movies_df` DataFrame, that classifies movies into three categories based on their runtime:

- `"Short"` if runtime is **less than 90 minutes**
- `"Medium"` if runtime is **between 90 and 120 minutes (inclusive)**
- `"Long"` if runtime is **more than 120 minutes**

---

### 💡 Hints:

- Use the column `Runtime (Minutes)`.
- Define a function with if-elif-else to return the category.
- Use `.apply()` with a `lambda` to apply your function to each value in the column.

---

### ✅ Example Sample Output: 
(please note other columns in `movies_df` dataframe are still present, the following output just focuses on the runtime and length category as an example for a sample output).

| Title            | Runtime (Minutes) | Length Category |
|------------------|------------------|-----------------|
| Split            | 117              | Medium          |
| Suicide Squad    | 123              | Long            |
| Sing             | 108              | Medium          |

---


## Data Visualization

Pandas has a built-in visualization system. It is based on Matplotlib.

We will learn more about the different types of plots and how to customize them later, but for now, let's just have a quick look at some basic plots.

### Scatter Plots
- Used to show the relationship between **two numeric variables**.
- Each point represents one observation.
- Great for spotting **correlations**, **clusters**, or **outliers**.

#### Line Plots
- Used to show **trends over time** or across ordered categories.
- Often used with **time series** data to see how a value changes (e.g., yearly average).
- Points are connected with lines to highlight the trend.

#### Bar Plots
- Used to compare **values across categories** (e.g., number of movies per genre).
- Bars can be vertical or horizontal.
- Good for seeing **which categories are bigger or smaller**.

#### Histograms
- Used to show the **distribution** of a single numeric variable.
- Helps you see the **shape** of the data (e.g., is it skewed? where is it concentrated?).


### Scatter Plots

We can use the `scatter()` method to create a scatter plot.

What is the relationship between the revenue and the rating of the movies? Are highly rated movies more profitable?

In [None]:
movies_df_actors.plot(kind="scatter", x="Rating", y="Revenue (Millions)", title="Revenue by Rating")

### Line Plots

How does the average Metascore change over time? We can use a line plot to visualize this.

In [None]:


# Group the data by 'Year'
grouped_by_year = movies_df_actors.groupby("Year")

# Calculate the average Metascore per year
average_metascore = grouped_by_year.Metascore.mean()

# Plot the average Metascore as a line plot
average_metascore.plot(kind="line", title="Metascore by Year", figsize=(10, 5)) # # The figsize parameter is used to set the size of the plot.


### Bar Plots

Let's use a bar plot to visualize the number of movies per year.

In [None]:
# Group the data by 'Year'
grouped_by_year = movies_df_actors.groupby("Year")

# Count the number of unique movie titles per year
movies_per_year = grouped_by_year.Title.nunique()

# Plot the result as a bar chart
movies_per_year.plot(kind="bar", x="Year", title="Number of Movies per Year", figsize=(10, 5))


### Histograms

We can use a histogram to visualize the distribution of a column.

Let's check the distribution of Ratings. What are your key observations from the histogram?

In [None]:
movies_df_actors.Rating.hist(bins=10, figsize=(10, 5))

## Task 3

#### 3.1 Who are the highest-rated actors?

#### 3.2 Which director-actor combination generates the highest revenue?

#### 3.3. Do actors whose names start with a J tend to have higher ratings?

#### 3.4 Are movies getting longer over time ? Use an appropriate plot to investigate. 

#### 3.5 Do movies with more votes get higher ratings ?
