# Zoho Offline Assignment
## With the given dataset(Rotten_Tomatoes_Movies3.xls), build a model to predict 'audience_rating'. Demonstrate the working of the pipeline with a notebook, also validate the model for its accuracy.

### Importing Required Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


### Loading the Dataset

In [2]:
# Load the dataset
file_path = 'Rotten_Tomatoes_Movies3.xls'  # Use your file path
data = pd.read_excel(file_path)


### Handling Missing Data

In [3]:
# Check and handle missing data
numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
data[numerical_columns] = data[numerical_columns].fillna(data[numerical_columns].mean())

# Fill missing categorical values with an empty string or a placeholder like 'Unknown'
categorical_columns = data.select_dtypes(include=['object']).columns
data[categorical_columns] = data[categorical_columns].fillna('')


### Processing Date Columns

In [4]:
# Convert datetime columns (in_theaters_date, on_streaming_date)
if 'in_theaters_date' in data.columns:
    data['in_theaters_date'] = pd.to_datetime(data['in_theaters_date'], errors='coerce')
    data['theater_year'] = data['in_theaters_date'].dt.year
    data['theater_month'] = data['in_theaters_date'].dt.month
    data.drop(columns=['in_theaters_date'], inplace=True)

if 'on_streaming_date' in data.columns:
    data['on_streaming_date'] = pd.to_datetime(data['on_streaming_date'], errors='coerce')
    data['streaming_year'] = data['on_streaming_date'].dt.year
    data['streaming_month'] = data['on_streaming_date'].dt.month
    data.drop(columns=['on_streaming_date'], inplace=True)


### Ensuring Uniformity in Categorical Columns

In [5]:
# Ensure categorical columns (like 'movie_title' and 'studio_name') are uniform (convert mixed types to strings)
data['movie_title'] = data['movie_title'].astype(str)
data['studio_name'] = data['studio_name'].astype(str)


### Encoding Categorical Variables

In [6]:
# Apply LabelEncoder to categorical columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    if data[column].dtype == 'object':  # Apply encoding only to object type (string) columns
        try:
            data[column] = label_encoder.fit_transform(data[column])
        except Exception as e:
            print(f"Could not encode column {column}: {e}")


### Checking Data Types After Encoding

In [7]:
# After encoding, recheck the data types
print("Data types after encoding categorical columns:")
print(data.dtypes)


Data types after encoding categorical columns:
movie_title             int64
movie_info              int64
critics_consensus       int64
rating                  int64
genre                   int64
directors               int64
writers                 int64
cast                    int64
runtime_in_minutes    float64
studio_name             int64
tomatometer_status      int64
tomatometer_rating      int64
tomatometer_count       int64
audience_rating       float64
theater_year          float64
theater_month         float64
streaming_year        float64
streaming_month       float64
dtype: object


### Target and Feature Variables

In [8]:
# Now we target 'audience_rating' instead of 'tomatometer_rating'
y = data['audience_rating']  # Update to 'audience_rating'

# Feature variables: All other columns except the target variable
X = data.drop(columns=['audience_rating'])  # Drop 'audience_rating' column from features


### Splitting the Data into Training and Testing Sets

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the resulting datasets
print("Training data (X_train) shape:", X_train.shape)
print("Testing data (X_test) shape:", X_test.shape)
print("Training labels (y_train) shape:", y_train.shape)
print("Testing labels (y_test) shape:", y_test.shape)


Training data (X_train) shape: (13310, 17)
Testing data (X_test) shape: (3328, 17)
Training labels (y_train) shape: (13310,)
Testing labels (y_test) shape: (3328,)


### Handling Missing Values Using Imputer

In [10]:
# Create an imputer to fill missing values with the mean
imputer = SimpleImputer(strategy='mean')

# Fit the imputer on the training data and transform both the training and test sets
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)


### Scaling the Data

In [11]:
# Now scale the imputed data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)


### Training the Model

In [12]:
# Initialize and train the model
model = LinearRegression()
model.fit(X_train_scaled, y_train)


### Making Predictions

In [13]:
# Make predictions
y_pred = model.predict(X_test_scaled)


### Evaluating the Model

In [14]:
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print evaluation metrics
print("Mean Squared Error:", mse)
print("R-squared Score:", r2)


Mean Squared Error: 214.51140193767048
R-squared Score: 0.47414403322746057


### Displaying Actual vs Predicted Values

In [15]:
# Optionally, display actual vs predicted values
print("\nFirst few actual vs predicted values:")
print("Actual values:", y_test.head())
print("Predicted values:", y_pred[:5])



First few actual vs predicted values:
Actual values: 4013     44.0
6119     41.0
12585    72.0
4395     87.0
9070     90.0
Name: audience_rating, dtype: float64
Predicted values: [48.92638435 54.27892713 66.99295862 77.85063655 80.99386882]
