In [None]:
import seaborn as sns
import pandas as pd

In [None]:
movies = pd.read_csv('top_movies_2017.csv')
movies

**Create a new column showing each movie’s age in 2025.**

In [None]:
movies['Age'] = 2025 - movies['Year']
movies

# Bar plot

### Show the relationship between a **categorical attribute** and a **numerical attribute**.

**Try a bar plot between a categorical variable (`Title`) and a numerical variable (`Age`)**

In [None]:
sns.barplot(data=movies, 
            x='Title', 
            y='Age')

# what's wrong with this??

The chart is almost unreadable because there are **too many categories** (200 movie titles).  
This shows a common issue: not every dataset can be visualized in full.

A quick fix is to **subset the data**. Here we take the first 10 movies (`.head(10)`).  

In [None]:
ten_movies = movies.head(10)

sns.barplot(data=ten_movies, 
            y='Title', 
            x='Age')

Now the chart is **readable**, but the choice of movies is arbitrary.

Instead of simply taking the first 10 rows, we can choose a subset based on a criterion.  
Example: the **10 oldest movies** in the dataset.

### Task: Find the **Age** of **top-10 oldest** movies.

In [None]:
# more thought into picking which 10 movies to show

ten_oldest_movies = movies.sort_values('Age', ascending=False).head(10)

ten_oldest_movies

In [None]:
sns.barplot(data=ten_oldest_movies, 
            y='Title', 
            x='Age')

In [None]:
# sns.barplot(data=ten_oldest_movies, 
#             y='Title', 
#             x='Age',
#            color='tab:blue') # you can do this if you don't like rainbows 

# # https://matplotlib.org/stable/gallery/color/named_colors.html#base-colors

### Task: Find the **Age** of **top-10 grossing** movies.

You may want to sort the dataset by "Gross" column, using `.sort_values("Gross")`

In [None]:
movies.sort_values('Gross').head(10)

Remember: `sort_values` is ascending by default (smallest → largest), thus in order to find the movies with top-10 grossing, you need `.sort_values('Gross', ascending=False)`

In [None]:
highest_grossing_movies = movies.sort_values('Gross',ascending=False)
highest_grossing_movies.head(10)

In [None]:
sns.barplot(data=highest_grossing_movies.head(10), 
            x='Age', 
            y='Title')

From above figure we can say: Many of the highest-grossing movies are very young (less than 20 years old). Exceptions include *Titanic* (1997) and *Star Wars: Episode I* (1999), which are older but still remain in the top list.

But remember, when comparing money across time, consider inflation. Let's find the **Age of top-10 grossing (adjusted) movies**.

In [None]:
highest_adjgrossing_movies = movies.sort_values('Gross (Adjusted)',ascending=False)
sns.barplot(data=highest_adjgrossing_movies.head(10), 
            x='Age', 
            y='Title')

Once we adjust for inflation, **older classics rise to the top**:  *Gone with the Wind*, *The Sound of Music*, *Snow White*, etc.  

### Categorical distribution (you can try `sns.countplot`)

If you want to show the **count** of each category, you don’t need to calculate it yourself then use `sns.barplot`.  

👉 Use `sns.countplot`, which counts and plots in one step!

**Task**: Count how many movies each **Studio** has. This shows the distribution of a categorical variable (called categorical distribution).

If you insist to use `sns.barplot`, you would need to...

In [None]:
movies.Studio.value_counts()

In [None]:
sns.barplot(data=movies.Studio.value_counts())

Note: Why does `sns.barplot(data=movies.Studio.value_counts())` can give you a result?  

- `value_counts()` returns a **pandas Series** with:  
  - **Index** = category (Studio names)  
  - **Values** = counts  
  - **Name** = "count" (default)  

- Seaborn can automatically use:  
  - **Index** → x-axis  
  - **Values** → y-axis  
  - **Name** → axis label  

⚠️ But this only works nicely because the Series has both **index** and **name**.  
👉 Safer (and clearer) to pass a DataFrame into `sns.barplot`.

In [None]:
studio_counts = movies.Studio.value_counts().reset_index() # value_counts() gives a Series; reset_index() converts it into a DataFrame
studio_counts.columns = ['Studio', 'Count']  # Rename the two columns: first = Studio name, second = Count

sns.barplot(data=studio_counts, y='Studio', x='Count')

*Simpler way*: `sns.countplot` automatically counts categories, no need for `value_counts`!

In [None]:
sns.countplot(data=movies, 
              x='Studio') # Vertical bars

In [None]:
sns.countplot(data=movies, 
              y='Studio') # Horizontal bars

By default, categories appear in the order they show up in the dataset. This order is usually not meaningful.  

We can sort them by frequency using:  `order = movies.Studio.value_counts().index` which is using `order = DataFrame.Column.value_counts().index`

In [None]:
sns.countplot(data=movies, 
              y='Studio', 
              order=movies.Studio.value_counts().index)

You can also reverse the order with:  `movies.Studio.value_counts(ascending=True).index`

In [None]:
sns.countplot(data=movies, 
              y='Studio', 
              order=movies.Studio.value_counts(ascending=True).index,
             color='tab:blue')

# Histogram: Dealing with Numerical distribution

When we want to study the distribution of a **numerical variable** (like `Age`),  we use a histogram by calling `sns.histplot`. 

This groups values into bins and shows how many data points fall in each bin.

In [None]:
sns.histplot(data=movies, x='Age')

This shows the distribution of movie ages.  
Most movies in the dataset are between **10–40 years old**, but some are much older.  
Default bins may hide details, so we might want to explore different bin sizes.

**Step: Quick summary statistics**: To better understand the numerical column, we can use `.describe()` to check the **count, mean, min, max, and quartiles**.

In [None]:
movies['Age'].describe()

This tells us:  
- There are 200 movies.  
- The average age is ~37 years.  
- The oldest movie is 104 years old.  
This helps us choose reasonable bin ranges for the histogram.

**Step: Custom bins**: Instead of letting seaborn decide bin sizes, we can **control the bin edges**.

For example, bins of size 10 years (`range(0,110,10)`).

In [None]:
sns.histplot(data=movies, x='Age', bins = range(0,110,10)) # range goes up to but not including!

Now the histogram is grouped by decades (0–10, 10–20, …). This makes it easier to see how many movies belong to each decade range.

But wait, the maximum is 104, where is it in the histogram??

Remember: `range(start, stop, step)` goes **up to but not including** the stop value. So idealy we should set `range(0,111,10)`.

In [None]:
sns.histplot(data=movies, x='Age', bins = range(0,111,10))

Now the last bin correctly includes movies up to age 104.  

⚠️ Always check whether your bin range covers the **maximum value** in the data!

**Custom bin edges**: We can also define bins manually, not just with `range()`.  
Here we create bins that group movies into age categories like 0–5, 5–10, 10–15, 15-25 etc.

In [None]:
my_bins = [0, 5, 10, 15, 25, 40, 65, 104] # bins also go up to but not including!
sns.histplot(data=movies, x='Age', bins = my_bins)

This histogram is less uniform, but may be **more meaningful depending on the context**.  
Custom bins allow us to tell a story that fits the question we want to answer.

From above histogram we see the raw **counts** of movies in each age group.  
About 60 movies are between 40–65 years old — the largest group.  
This view answers the question: *"How many movies are in each bin?"*

⚠️ **Note:** By default, `sns.histplot` uses `stat="count"`, so bars show counts, not densities.    

**Percent vs Count**: By default, histograms show **counts**.  

Sometimes it is more intuitive to look at the **percentage of total movies** in each bin.  

We can use `stat='percent'`.

In [None]:
sns.histplot(data=movies, x='Age',stat='percent',bins=my_bins)

Now the y-axis is in **percent** instead of raw counts. But **the shape is same as "count"above**.

This is better for comparing across datasets of different sizes, since the total always sums to 100%.

**Density vs Count**：Another option is `stat='density'`.  

This scales the histogram so that the **total area = 1**, which is useful when comparing to probability distributions.

In [None]:
sns.histplot(data=movies, x='Age',stat='density',bins=my_bins)

Now the histogram can be interpreted as an **approximation of a probability distribution**.  
This is often used when we want to compare data with a **theoretical model** (e.g., Normal distribution).

The above histogram is scaled so the **total area = 1**.  
The highest density is around 20–30, showing that this is the most common age range.  
This view highlights the **shape of the distribution**, useful when comparing with probability models.

**Summary: Count vs Percent vs Density**

- **Count (default)** → Shows raw number of observations in each bin.  
  👉 Good when the dataset size is fixed and absolute numbers matter.

- **Percent (`stat='percent'`)** → Shows proportion of observations in each bin. (Same shape as Count)   
  👉 Good for comparing datasets of different sizes, since everything scales to 100%.

- **Density (`stat='density'`)** → Scales so total area = 1.  
  👉 Good when comparing to probability distributions or doing statistical modeling.

⚠️ Choosing the right option depends on the **story you want to tell** with the data.

**Let’s switch to a different dataset: incomes of top female actors in 2022.**

In [None]:
incomes = pd.read_csv('2022_female_actors.csv')
incomes.head()

This dataset has two columns: `Name` and `Income (millions)`. 

Quickly summarize the income column using `.describe()`.

In [None]:
incomes.describe()

We learn that:  
- Average income ≈ **26.5M**.  
- Range: from **12.5M** to **56M**.  
- Middle 50% fall between **21M–30M**.  
This helps us know what values to expect in the histogram.

Let’s first plot incomes using 10 equal-width bins.

In [None]:
sns.histplot( incomes, bins=10)

Most actors earn in the **22–26M** range. However, the shape depends heavily on how bins are chosen.

Now let’s force the bins to be every 6 million.  
This gives us finer control over the histogram.

In [None]:
sns.histplot( incomes, bins=range(0,61,6)) # very different than above!

we see clustering around **18–36M**.  
⚠️ This shows how **bin width (and where the bins start) affects interpretation**.

Instead of evenly spaced bins, we can **manually define income groups**.  
Here we split incomes into three categories: 0–25M, 25–30M, and 30–60M.

In [None]:
my_bins = [0, 25, 30, 60]
sns.histplot( incomes, bins=my_bins)

Now the histogram shows very uneven bin widths with y-axis to be counts. But we must be careful—unequal bins can make interpretation tricky.

We usually combine manual bins with `stat='density'` to interpret the histogram as a probability distribution.

In [None]:
sns.histplot( incomes, bins=my_bins, stat='density')

Here the total area = 1, and the tallest bar (~0.04) means that  
the **25–30M range** has the highest probability density.  
This view emphasizes the **shape of the distribution** rather than raw counts.

**Comparing Count vs Density (with unequal bins)**

- **Count:**  
  The y-axis shows the *number of actors* in each income bin.  
  👉 Most actors fall in the **0–25M** range, so this bin is tallest in the count view.

- **Density:**  
  The y-axis shows *probability density*, scaled so that the total area = 1.  
  👉 Even though the 25–30M bin is narrow, it still contains several actors.  
  This makes its **density the highest (~0.04)**, meaning a random actor is most likely to fall into this range. Area is also meaningful here.

  Dr. Jia's suggestion/preference: **for unequal-bins, use Density**. 