<div style="background-color: #6B46C1; color: white; padding: 20px; border-radius: 10px; margin: 10px 0;">

# 🎬 Golden-Age-TV-Data-Analysis: Project: Data Analysis with Pandas

In this project, you will work with a dataset containing information about movies and shows. You will use the pandas library to read, clean, and analyze the data. This project will help you practice working with DataFrames, indexing, and data manipulation using pandas.

## 📊 Dataset Description

The dataset `movies_and_shows.csv` contains information about various movies and shows, including:

- **name**: The name of the actor or actress
- **Character**: The character they played
- **r0le**: The role type (e.g., ACTOR)
- **TITLE**: The title of the movie or show
- **Type**: Whether it's a MOVIE or SHOW
- **release Year**: The year it was released
- **genres**: A list of genres the movie or show belongs to
- **imdb sc0re**: The IMDb score of the movie or show
- **imdb v0tes**: The number of votes on IMDb

</div>

<div style="background-color: #6B46C1; color: white; padding: 20px; border-radius: 10px; margin: 10px 0;">

## Instructions:

Read the movies_and_shows.csv file into a DataFrame. It is in the /datasets directory so the full path to include in the read_csv method will be "/datasets/movies_and_shows.csv"
Display the first few rows of the DataFrame to get an initial look at th
e data.

## Hints:

* Use pd.read_csv() to read the CSV file and store it in a variable called "df".
* Use the .head() method to display the first few rows.


In [19]:
##Importing Libraries
import pandas as pd
## Reading the dataset
df = pd.read_csv(r"C:\Users\robab\OneDrive\2025\Tripleten\Review TripleTen project\Sprint 1\movies_and_shows -sprint 1.csv")

In [20]:
df.head()
print(df.head())

print()

df.info()

   Unnamed: 0             name                Character   r0le        TITLE  \
0           0   Robert De Niro            Travis Bickle  ACTOR  Taxi Driver   
1           1     Jodie Foster            Iris Steensma  ACTOR  Taxi Driver   
2           2    Albert Brooks                      Tom  ACTOR  Taxi Driver   
3           3    Harvey Keitel  Matthew 'Sport' Higgins  ACTOR  Taxi Driver   
4           4  Cybill Shepherd                    Betsy  ACTOR  Taxi Driver   

    Type  release Year              genres  imdb sc0re  imdb v0tes  
0  MOVIE          1976  ['drama', 'crime']         8.2    808582.0  
1  MOVIE          1976  ['drama', 'crime']         8.2    808582.0  
2  MOVIE          1976  ['drama', 'crime']         8.2    808582.0  
3  MOVIE          1976  ['drama', 'crime']         8.2    808582.0  
4  MOVIE          1976  ['drama', 'crime']         8.2    808582.0  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85579 entries, 0 to 85578
Data columns (total 10 columns):
 

<div style="background-color: #6B46C1; color: white; padding: 20px; border-radius: 10px; margin: 10px 0;">

## Task 1: Data Cleaning
Let's clean the data to fix issues with the column names

### Instructions:

Rename the columns to correct any errors and make them consistent.
Hints:

Use the .rename() method to rename columns.
Pass a dictionary to the columns parameter of .rename(), where the keys are the old column names and the values are the new names.
Remove unnecessary whitespace from the column names.


In [21]:
print(df.columns)


Index(['Unnamed: 0', '   name', 'Character', 'r0le', 'TITLE', '  Type',
       'release Year', 'genres', 'imdb sc0re', 'imdb v0tes'],
      dtype='object')


In [22]:
## Renaming the columns to fix issues

df.rename(columns ={'   name':'name', 
                    '   year':'year', 
                    'Character': 'character',
                    'r0le': 'role',
                    'TITLE': 'title',
                    '  Type': 'type',
                     'release Year': 'release_year',
                      'genres': 'genres',
                      'imdb sc0re': 'imdb_score',
                      'imdb v0tes': 'imdb_votes'},
                    inplace=True)
print(df.columns)

Index(['Unnamed: 0', 'name', 'character', 'role', 'title', 'type',
       'release_year', 'genres', 'imdb_score', 'imdb_votes'],
      dtype='object')


<div style="background-color: #2563EB; color: white; padding: 20px; border-radius: 10px; margin: 10px 0;">
or we can use  the following code for modification of columns

## 📝 Code Explanation: Column Cleaning Chain

The following code demonstrates a powerful pandas method chaining approach for cleaning column names:

```python
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('0', 'o')
```

### Breaking Down Each Section:

| **Method** | **Purpose** | **Example** |
|------------|-------------|-------------|
| `.str.strip()` | **Removes whitespace** from the beginning and end of column names | `"   name"` → `"name"` |
| `.str.lower()` | **Converts to lowercase** for consistency | `"TITLE"` → `"title"` |
| `.str.replace(' ', '_')` | **Replaces spaces with underscores** to make columns Python-friendly | `"release Year"` → `"release_year"` |
| `.str.replace('0', 'o')` | **Fixes typos** where zeros were used instead of letter 'o' | `"r0le"` → `"role"`, `"sc0re"` → `"score"` |

### Why This Approach is Useful:
- ✅ **Method chaining**: Applies multiple transformations in one line
- ✅ **Consistent naming**: All columns follow the same naming convention
- ✅ **Python-friendly**: No spaces or special characters that could cause issues
- ✅ **Readable code**: Easy to understand the sequence of operations

</div>

## Task 2: Correcting a Misspelled Name in the Data

While analyzing the dataset, you notice that some names are misspelled or contain special characters due to encoding issues. Accurate data is essential for reporting and recommendations, so let’s correct one of these entries.

### Instructions

1. **Locate the Row with the Incorrect Name**:
   - Use `.loc[]` to retrieve the row where `name` is `"In??s Prieto"`.
   - You can locate a row based on the index (85576) and column name called "name".
   - Print the row to verify that you have the correct one.


2. **Correct the Name**:
   - Using `.loc[]`, update the `name` column for this row to "Ines Prieto."
   
3. **Verify the Correction**:
   - Print the row again to ensure that the name has been corrected.

<div style="background: linear-gradient(135deg, #1e40af 0%, #059669 100%); color: white; padding: 25px; border-radius: 15px; margin: 15px 0; box-shadow: 0 8px 32px rgba(0,0,0,0.3);">

## 📍 Understanding `.loc` - Label-Based Data Selection

The `.loc` accessor is pandas' primary tool for **label-based indexing**. It allows you to select data using actual index labels and column names.

### 🔧 Basic Syntax:
<div style="background-color: #1f2937; padding: 15px; border-radius: 8px; border-left: 4px solid #fbbf24; margin: 10px 0; color: white;">

```python
df.loc[row_indexer, column_indexer]
```
</div>

### 📋 Common Applications:

<div style="background-color: rgba(255,255,255,0.1); padding: 15px; border-radius: 10px; margin: 15px 0; color: white;">

| **Usage** | **Syntax** | **Description** |
|-----------|------------|-----------------|
| 🎯 **Single Value** | <code style="background: #dc2626; padding: 3px 6px; border-radius: 4px; color: white;">df.loc[row, col]</code> | Get value at specific row/column |
| 📄 **Single Row** | <code style="background: #7c3aed; padding: 3px 6px; border-radius: 4px; color: white;">df.loc[row]</code> | Get entire row |
| 📚 **Multiple Rows** | <code style="background: #059669; padding: 3px 6px; border-radius: 4px; color: white;">df.loc[start:end]</code> | Get slice of rows |
| 🎨 **Specific Columns** | <code style="background: #ea580c; padding: 3px 6px; border-radius: 4px; color: white;">df.loc[:, ['col1', 'col2']]</code> | Get specific columns for all rows |
| 🔍 **Conditional Selection** | <code style="background: #0891b2; padding: 3px 6px; border-radius: 4px; color: white;">df.loc[df['col'] > 5]</code> | Filter rows based on condition |
| ✏️ **Update Values** | <code style="background: #be123c; padding: 3px 6px; border-radius: 4px; color: white;">df.loc[row, col] = new_value</code> | Modify specific cell |

</div>

### ⚡ Examples in Context:

<div style="background-color: #0f172a; padding: 20px; border-radius: 10px; border: 2px solid #fbbf24; margin: 15px 0; color: white;">

**🔎 Get single value (what we're doing below):**
<div style="background: linear-gradient(90deg, #7c3aed, #a855f7); padding: 8px; border-radius: 6px; margin: 8px 0; color: white;">

```python
df.loc[85576, 'name']
```
</div>

**📋 Get entire row:**
<div style="background: linear-gradient(90deg, #7c3aed, #a855f7); padding: 8px; border-radius: 6px; margin: 8px 0; color: white;">

```python
df.loc[85576]
```
</div>

**✏️ Update a value:**
<div style="background: linear-gradient(90deg, #be123c, #e11d48); padding: 8px; border-radius: 6px; margin: 8px 0; color: white;">

```python
df.loc[85576, 'name'] = "Ines Prieto"
```
</div>

**🔍 Filter and select columns:**
<div style="background: linear-gradient(90deg, #7c3aed, #a855f7); padding: 8px; border-radius: 6px; margin: 8px 0; color: white;">

```python
df.loc[df['imdb_score'] > 8.0, ['title', 'imdb_score']]
```
</div>

</div>

### ⚠️ Important Notes:
<div style="background-color: rgba(239, 68, 68, 0.2); border: 2px solid #ef4444; padding: 15px; border-radius: 10px; margin: 10px 0; color: white;">

- 🏷️ Uses **index labels**, not positions
- 📏 Includes both endpoints in slices  
- 🛠️ Perfect for data cleaning and targeted updates

</div>

</div>

In [23]:
 # Locate the row with the incorrect name
#find the location 
df.loc[85576,'name'] 

'In??s Prieto'

In [24]:
# Correct the misspelled name
df.loc[85576, 'name'] = "Ines Prieto"

# Verify the correction
print("After correction:")
print(df.loc[85576, 'name'])

After correction:
Ines Prieto


## Task 3: Finding All Movies and Shows Featuring Ines Prieto

Now that we've corrected the spelling of "Ines Prieto" in the dataset, let's find all the TV shows and movies she has acted in. This type of filtering is helpful for generating actor-specific profiles or building a list of their works.

### Instructions

1. **Filter by Actor’s Name**:
   - Use a filtering condition to select rows where the `name` column is equal to `"Ines Prieto"`.
   
2. **Display Relevant Columns**:
   - From each matching row, retrieve only the `title`, `release_year`, `imdb_score`, and `genres` columns for a clear, concise output.

**Hint:**

To filter rows based on a specific value in a column, use a condition inside df[ ... ]. In this case, check if the name column equals "Ines Prieto". Then, select only the columns you need (like title, release_year, imdb_score, and genres) by specifying them in double brackets [ [ ... ] ].

In [25]:
filter_name= df[df['name']=="Ines Prieto"]
print(filter_name[['title', 'release_year', 'imdb_score', 'genres']])

         title  release_year  imdb_score      genres
85576  Lokillo          2021         3.8  ['comedy']


In [26]:
filter_name= df[df['name']=="Ines Prieto"][['title', 'release_year', 'imdb_score', 'genres']]
print(filter_name)

         title  release_year  imdb_score      genres
85576  Lokillo          2021         3.8  ['comedy']


In [27]:
filter_name= df[df['name']=="Ines Prieto"][['title', 'release_year', 'imdb_score', 'genres']]
print(df.loc[filter_name.index])


       Unnamed: 0         name character   role    title       type  \
85576       85576  Ines Prieto     Fanny  ACTOR  Lokillo  the movie   

       release_year      genres  imdb_score  imdb_votes  
85576          2021  ['comedy']         3.8        68.0  


<div style="background: linear-gradient(135deg, #dc2626 0%, #ea580c 100%); color: white; padding: 25px; border-radius: 15px; margin: 15px 0; box-shadow: 0 8px 32px rgba(0,0,0,0.3);">

## 🔍 Comparing the Three Filtering Approaches

Analyzing the differences between the three code cells above (cells 15, 16, and 17):

### 📊 **Method 1 - Two-Step Approach** (Cell 15):
<div style="background-color: #1f2937; padding: 15px; border-radius: 8px; border-left: 4px solid #10b981; margin: 10px 0; color: white;">

```python
filter_name = df[df['name']=="Ines Prieto"]
print(filter_name[['title', 'release_year', 'imdb_score', 'genres']])
```
</div>

**What it does:**
- 🎯 **Step 1**: Creates `filter_name` containing ALL columns for Ines Prieto rows
- 🎯 **Step 2**: Selects only specific columns for display
- ✅ **Memory efficient**: Keeps original filtered data intact
- ✅ **Flexible**: Can reuse `filter_name` for other operations

### 📊 **Method 2 - One-Step Column Selection** (Cell 16):
<div style="background-color: #1f2937; padding: 15px; border-radius: 8px; border-left: 4px solid #f59e0b; margin: 10px 0; color: white;">

```python
filter_name = df[df['name']=="Ines Prieto"][['title', 'release_year', 'imdb_score', 'genres']]
print(filter_name)
```
</div>

**What it does:**
- 🎯 **Single Step**: Filter rows AND select columns in one line
- ✅ **Concise**: More compact code
- ✅ **Direct**: `filter_name` only contains the 4 specified columns
- ⚠️ **Limited**: Can't access other columns later without re-filtering

### 📊 **Method 3 - Index-Based Retrieval** (Cell 17):
<div style="background-color: #1f2937; padding: 15px; border-radius: 8px; border-left: 4px solid #ef4444; margin: 10px 0; color: white;">

```python
filter_name = df[df['name']=="Ines Prieto"][['title', 'release_year', 'imdb_score', 'genres']]
print(df.loc[filter_name.index])
```
</div>

**What it does:**
- 🎯 **Step 1**: Creates filtered DataFrame with 4 columns
- 🎯 **Step 2**: Uses the INDEX from filtered data to get ALL columns from original DataFrame
- ⚠️ **Circular**: Defeats the purpose of column selection
- ❌ **Inefficient**: Extra work to get back to full row data

### 🏆 **Recommendation:**

<div style="background-color: rgba(16, 185, 129, 0.2); border: 2px solid #10b981; padding: 15px; border-radius: 10px; margin: 10px 0; color: white;">

**Use Method 2** for this task - it's the most direct and efficient approach:

```python
filtered_data = df[df['name']=="Ines Prieto"][['title', 'release_year', 'imdb_score', 'genres']]
```

</div>

</div>

## Task 4: Finding Highly Rated Movies

We want to identify movies with an IMDb rating of at least **9.0**. This list could be helpful for curating a "Top Movies" section based on high ratings.

### Instructions

1. **Filter for High IMDb Scores**:
   - First, filter the DataFrame to include only rows where the `imdb_score` is greater than 9.0.

2. **Extract the Titles**:
   - From this filtered DataFrame, select only the `title` column, which contains the names of the movies.

3. **Get Unique Titles**:
   - Convert the resulting list of titles to a set to remove any duplicate titles. Using `set()` will keep only unique movie names.

   - *Example*: `unique_titles = set(high_score_titles)`

4. **Print the Unique Titles**:
   - Display the final set of unique movie titles to see the list of top-rated movies.

In [39]:
# Filter for movies with an IMDb score above 9.0
imdb_above_nine = df[df['imdb_score']>9.0]  # this is a new dataframe


# Extract the 'title' column from the filtered DataFrame
idmb_titles = imdb_above_nine['title']
# print(idmb_titles)

# Get a unique set of titles
unique_idmb_titles = idmb_titles.unique()


# Print the unique titles
print(unique_idmb_titles)


['Breaking Bad' 'Avatar: The Last Airbender' 'Reply 1988' 'My Mister'
 'The Last Dance' 'Our Planet' 'Kota Factory' 'Major']
