# Video Game Sales  Analysis Using Linear Regression

The model aims to build a **Linear Regression model** to predict a **Sales of Video Game** based on various Platforms and Sales. The dataset undergoes comprehensive preprocessing, including handling missing values, encoding categorical variables. The model’s performance is evaluated using metrics such as **RMSE, MAE, and R²**. To optimize results, **Ridge, Lasso, ElasticNet** is used for fine-tuning hyperparameters. The final model effectively predicts Sales.

## **Step 1 : Import Libraries and Load Data**

* **pandas**: Used for data manipulation and analysis, offering powerful data structures like DataFrames to handle and process structured data efficiently.
* **numpy**: Provides support for numerical operations on large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
* **matplotlib.pyplot**: Used for creating static, interactive, and animated visualizations in Python, enabling plotting of various graphs such as line charts, scatter plots, and histograms.
* **seaborn**: A data visualization library built on matplotlib, providing a high-level interface to create statistical graphics such as heatmaps, pair plots, and box plots.
* **sklearn.model_selection**:
    * **train_test_split**: Splits the dataset into training and testing sets, ensuring an appropriate balance for model training and validation.
     * **cross_val_score**: Evaluates model performance through cross-validation by splitting the data into multiple folds and computing average accuracy.
* **sklearn.preprocessing**:
  * **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance, ensuring all features contribute equally to the model.
  * **OneHotEncoder**: Converts categorical variables into a binary matrix (dummy variables), enabling models to interpret non-numerical features.
  *  **LabelEncoder**: convert categorical data into numerical values, making it suitable for machine learning models. It assigns a unique integer to each category.

* **sklearn.linear_model**:
  * **LinearRegression**: Builds a linear model by fitting a straight line to the data, predicting the target variable based on input features.
  * **Ridge**: A variation of linear regression that introduces L2 regularization, which reduces model complexity by penalizing large coefficients.
  * **Lasso**: Applies L1 regularization to linear regression, effectively performing feature selection by shrinking less important feature coefficients to zero.
  *  **ElasticNet**: It is a type of linear regression that combines Lasso (L1) and Ridge (L2) regularization to improve model performance and prevent overfitting. It is useful when working with high-dimensional data or when predictors are highly correlated.
    
* **sklearn.metrics**:
  * **mean_squared_error (MSE)**: Measures the average squared difference between actual and predicted values, penalizing large errors.
  * **mean_absolute_error (MAE)**: Computes the average absolute difference between actual and predicted values, giving equal weight to all errors.
  * **r2_score (R²)**: Indicates how well the model fits the data, representing the proportion of variance explained by the model.


In [368]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler,LabelEncoder
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge, Lasso, ElasticNet 
import warnings 
warnings.filterwarnings('ignore')

In [None]:
df= pd.read_excel(r"C:\Users\Sanal Rumao\Downloads\vgsales (1).xlsx")

In [None]:
df

In [None]:
df['Publisher'].nunique()

In [None]:
df.head(10)

In [None]:
df.tail()

## **Step 2 : Exploratory Data Analysis**

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().any()

In [None]:
df.isnull().sum()

In [None]:
sns.heatmap(df.isnull(),cmap='Spectral')

## **Step 3 : Data Preprocessing**

#### Drop columns with too many missing values

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

In [None]:
df.count()

In [None]:
df.shape

In [None]:
# Sum of sales per region  
sales_regions = df[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()  

plt.figure(figsize=(8, 8))  
plt.pie(sales_regions, labels=sales_regions.index, autopct='%1.1f%%', colors=['blue', 'red', 'green', 'purple'], startangle=140)  
plt.title("Sales Distribution by Region")  
plt.show()  

In [None]:
df.columns

In [None]:
df["Platform"].value_counts()

### Bar Chart - Platform-wise Sales

In [None]:
platform_sales = df.groupby('Platform')['Global_Sales'].sum().sort_values(ascending=False)[:10]  # Top 10 platforms
plt.figure(figsize=(12, 6))
sns.barplot(x=platform_sales.values, y=platform_sales.index, palette="coolwarm")
plt.title("Top 10 Platforms by Global Sales")
plt.xlabel("Total Sales (millions)")
plt.ylabel("Platform")
plt.show()

### Sales Distribution by Genre – Pie Chart

In [None]:
genre_sales = df.groupby('Genre')['Global_Sales'].sum().sort_values(ascending=False)  

plt.figure(figsize=(8, 8))  
plt.pie(genre_sales, labels=genre_sales.index, autopct='%1.1f%%', colors=sns.color_palette("coolwarm", len(genre_sales)), startangle=140)  
plt.title("Sales Distribution by Genre")  
plt.show()

### Sales Distribution by Publisher (Top 10 Publishers) – Bar Chart

In [None]:
publisher_sales = df.groupby('Publisher')['Global_Sales'].sum().sort_values(ascending=False)[:10]  

plt.figure(figsize=(8, 4))  
sns.barplot(x=publisher_sales.values, y=publisher_sales.index, palette="magma")  
plt.title("Top 10 Publishers by Global Sales")  
plt.xlabel("Total Sales (Millions)")  
plt.ylabel("Publisher")  
plt.show()  

In [None]:
categorical_cols = ['Platform','Genre',	'Publisher']


In [None]:
print(df[categorical_cols].dtypes)

**Define column transformer to scale and encode**

In [None]:
label_encoder=LabelEncoder()
categorical_columns = ['Platform','Genre','Publisher']
for col in categorical_columns:
    df[col] =df[col].astype(str)
    df[col] = label_encoder.fit_transform(df[col])

In [None]:
type(categorical_cols)

### Droped columns not needed

In [None]:
df=df.drop(columns =['Rank','Name','Year']) 

## **Define features (X) and target (y)**

In [None]:
x = df.drop(columns =['Global_Sales'])
y = df['Global_Sales']

In [None]:
x

## **Step 4: Model Building**

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.8,random_state=0)

**Training the model**

In [None]:
LR = LinearRegression()

In [None]:
LR

In [None]:
LR.fit(x_train, y_train)

In [None]:
y_pred=LR.predict(x_test)

In [None]:
mae = mean_absolute_error(y_test,y_pred)
mae

In [None]:
mse= mean_squared_error(y_test,y_pred)
mse

In [None]:
rmse = np.sqrt(mean_squared_error(y_test,y_pred))
rmse

In [None]:
r2=r2_score(y_test,y_pred)
r2

## **Step 5: Hyperparameter Tuning**

In [None]:
lr_model = LinearRegression()
lr_score = cross_val_score(lr_model, x_train, y_train, cv=5)

In [None]:
ridge_model = Ridge(alpha=1.0)
ridge_score = cross_val_score(ridge_model, x_train, y_train, cv=5)

In [None]:

lasso_model = Lasso(alpha=1.0)
lasso_score = cross_val_score(lasso_model, x_train, y_train, cv=5)

# **Step 6 : Model Evaluation**

In [None]:
lr_model.fit(x_train, y_train)
lr_predictions = lr_model.predict(x_test)
lr_mse = mean_squared_error(y_test, lr_predictions)
lr_mse
lr_r2 = r2_score(y_test, lr_predictions)
lr_r2*100

In [None]:
ridge_model.fit(x_train, y_train)
ridge_predictions = ridge_model.predict(x_test)
ridge_r2 = r2_score(y_test, ridge_predictions)
ridge_r2*100

In [None]:
lasso_model.fit(x_train, y_train)
lasso_predictions = lasso_model.predict(x_test)
lasso_r2 = r2_score(y_test, lasso_predictions)
lasso_r2*100

In [None]:
ElasticNet_model = ElasticNet(alpha=1.0)
ElasticNet_score = cross_val_score(ElasticNet_model , x_train, y_train, cv=5)


In [None]:
ElasticNet_model.fit(x_train, y_train)
ElasticNet_predictions = ElasticNet_model.predict(x_test)
ElasticNet_r2 = r2_score(y_test, ElasticNet_predictions)
ElasticNet_r2*100