### Day_014: FEATURE ENGINEERING & DATA PREPROCESSING FOR ML
***Today's Goal:*** create better features from existing data & prepare data before training models.

### Load the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# read the data 
df = pd.read_csv(r"C:\Users\MALWADE TANYA\Downloads\titles.csv")
df.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,48,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,"['crime', 'drama']",['US'],,tt0075314,8.3,795222.0,27.612,8.2
2,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['comedy', 'fantasy']",['GB'],,tt0071853,8.2,530877.0,18.216,7.8
3,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,['comedy'],['GB'],,tt0079470,8.0,392419.0,17.505,7.8
4,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,['horror'],['US'],,tt0070047,8.1,391942.0,95.337,7.7


### Create Content Age
- Older content may have different ratings than new content.

In [3]:
from datetime import datetime

current_year = datetime.now().year
df['content_age'] = current_year - df['release_year']


### Runtime Category
- Instead of just runtime number, group into categories.

In [4]:
df['runtime_category'] = pd.cut(
    df['runtime'],
    bins=[0, 60, 120, 180, 500],
    labels=['Short','Medium','Long','Very Long']
)


- Helps model understand movie length type.

### Popularity Level

In [5]:
df['popularity_level'] = pd.qcut(
    df['tmdb_popularity'],
    q=4,
    labels=['Low','Medium','High','Very High']
)


- Converts continuous popularity into meaningful groups.

### Vote Intensity

In [6]:
df['vote_category'] = pd.qcut(
    df['imdb_votes'],
    q=4,
    labels=['Low Votes','Moderate Votes','High Votes','Very High Votes']
)

### Number of Genres Per Title

In [7]:
df['genre_count'] = df['genres'].apply(lambda x: len(x.split(',')))

- More genres = more diverse content

### DATA PREPROCESSING FOR ML

### Select Final Features

In [23]:
y = df['imdb_score']

X = df.drop([
    'imdb_score',
    'id',
    'title',
    'description',
    'imdb_id'
], axis=1)


### Check which columns are categorical

In [24]:
X.select_dtypes(include='object').columns


Index(['type', 'age_certification', 'genres', 'production_countries'], dtype='object')

### Convert ALL categorical columns at once

In [25]:
X = pd.get_dummies(X, drop_first=True)


### Remove Boolean Type

In [26]:
bool_cols = X.select_dtypes(include='bool').columns
X[bool_cols] = X[bool_cols].astype(int)


In [27]:
print(X.select_dtypes(include='object').columns)


Index([], dtype='object')


### Train Test split

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

### Feature Scaling

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


#### Learnings
- Performed feature selection and separated independent (X) and target (y) variables
- Applied one-hot encoding to convert categorical features into numerical format
- Ensured all features were numeric by removing object and converting boolean types
- Split the dataset into training and testing sets
- Standardized features using feature scaling for model readiness