## Movie Data Analysis & Automation Scripting Assessment
### Author: Richard Nguyen
### Date: July 2025

This report summarizes the data processing, analysis, and machine learning tasks performed on a movie dataset, following the instructions provided in the technical assessment prompt.

---

## 1. Data Loading & Preprocessing

The dataset was provided in CSV format and imported using Python's `pandas` library. Conducted a manual inspection on the dataset on Excel to verify the structure, column types, and presence of null or inconsistent values.

### Data Loading
```python
import pandas as pd
import re

# Load data into a pandas df
df = pd.read_csv('../data/movies.csv')

# Verify data is loaded correctly
## Check rows + cols exist
num_rows, num_cols = df.shape
print(f"Row Count: {num_rows}")
print(f"Column Count: {num_cols}")

## Data Specs
print(df.info())
```

- Loaded `movies.csv` using `pd.read_csv()`
- Verified number of rows and columns using `.shape`
- Inspected data structure and types using `.info()`

### Data Preprocessing
The preprocessing portion primarily focused on cleaning and transforming critical columns into usable types to prepare for my analysis and modeling:

#### Key Cleaning Steps
- **Removed rows with missing `Year`** to ensure meaningful data.
- **Cleaned the `Year` column** using regex to extract 4-digit years and dropped any invalid or missing values.
- **Standardized `Duration`** by extracting the integer portion from strings like "90 min" using regex. A flag column was added to indicate missing durations.
- **Checked for and reported missing values** across all columns with a helper function that printed both counts and percentages.

#### Example Function (used for cleaning `Year`):
```python
def clean_year(value):
    if pd.isna(value):
        return None
    match = re.search(r'\d{4}', str(value))
    if match:
        return int(match.group())
    return None
```

Overall, this step established a **cleaned and structured dataset**, which was saved as `movies_cleaned.csv` and used for all subsequent steps.

---

## 2. Data Analysis & Visualization

Following preprocessing, we conducted both basic and advanced analyses to extract meaningful insights from the movie dataset. This included descriptive statistics, trend identification, correlation examination, and visual storytelling.

### Descriptive Statistics

I initially began with analyzing the basic metrics of the dataset, using the `describe()` method.

![Descriptive Analysis](./imgs/decriptive_analysis.png)

Furthermore, I continued with analyzing the frequency counts and top values for several categorical features:

- **Top 10 Genres**: Highlighted the most prevalent movie genres.
- **Top 10 Directors**: Identified the most represented directors.
- **Top 10 Actors**: Aggregated appearances across the three actor columns.
- **Top 10 Voted Movies**: Sorted by number of user votes.

Each of these was visualized using horizontal bar plots, arranged in a multi-plot grid for readability.

![Top 10s Analysis](./imgs/top_10_analysis.png)

### Distribution Plots

To understand how key numerical attributes are distributed:

- **Ratings**: Slightly right-skewed, with most movies rated between 6.0 and 7.5.
- **Duration**: Concentrated around the 110-150 minute mark.
- **Year**: Peaks around the 2000s–2010s, suggesting more recent movies dominate the dataset.

I visualized these using histograms:

![Distribution Analysis](./imgs/distribution_analysis.png)

### Correlation Analysis

I computed the Pearson correlation coefficients for numerical columns:

```
            Rating     Votes  Duration      Year
Rating    1.000000  0.126635 -0.031093 -0.166673
Votes     0.126635  1.000000  0.099660  0.129016
Duration -0.031093  0.099660  1.000000 -0.374097
Year     -0.166673  0.129016 -0.374097  1.000000
```

**Key Insights:**
- `Votes` has a weak positive correlation with `Rating`, indicating some relationship between popularity and perception.
- `Year` has a slightly stronger negative correlation in comparison to the other results with `Duration`, suggesting newer movies trend shorter.

### Trend Analysis

Used line plots to track average trends over time:

- **Duration over Time**: A visible decline in average runtime from early 2000s onward.
- **Votes over Time**: Steady growth in votes, reflecting expanding audiences or platforms.

![Over Time Analysis](./imgs/trends_OT.png)

---

## 3. Predictive Modeling

To explore potential applications of machine learning to this dataset, I attempted to build a simple regression model to predict movie ratings using available numeric features.

### Linear Regression

I began with a basic linear regression model using features like `Votes`, `Year`, and `Duration`. While this model was easy to interpret, the performance was poor:

- **Mean Squared Error (MSE)**: 1.78
- **R^2 Score**: 0.08

This low R^2 score indicated that the linear model could not effectively explain the variance in ratings.

After `log-transforming` the votes column, results improved slightly:

- **Mean Squared Error (MSE)**: 1.72
- **R^2 Score**: 0.11


### Random Forest Regressor

To improve performance, I then applied a **Random Forest Regressor**, which is more flexible and can handle non-linear relationships without assuming a fixed functional form.

- After log-transforming the `Votes` column, performance improved modestly.
- The **R² score increased to approximately 0.23**, indicating slightly better predictive results.

This result suggests that while numeric variables like `Votes`, `Duration`, and `Year` provide some signal, much of the predictive variance may likely lie in categorical data like `Genre`, `Actors`, and `Director`, if any.

### Next Steps (If More Time Were Available)

Given more time, I would have:

- **Implemented One-Hot Encoding** for the `Genre` column (and possibly actors or directors) to convert them into numerical features for further modeling.
- **Performed feature selection** or dimensionality reduction on these encoded variables.
- **Tried gradient boosting models** like XGBoost for better performance on sparse and mixed data types.