
**This dataset contains a list of video games with sales greater than 16.6k copies with 11 columns. It was generated by a scrape of vgchartz.com.**

Breakdown of this notebook: Loading the dataset: Load the data and import the libraries.

**1. Data Cleaning:**

    Dropping Uncessary col's.
    Dropping duplicates.
    Checking Missing Value Percent with respect to each column.
    Dropping row's on having atleast one null value.
    Reanaming Column's
    Feature Transfomation

**2. Data Visualization**

    using bar graph
    using pie chart
    using histogram

**3. Regression Analysis**  
   
    Linear Regression.
    Random Forest Regression.



### **To find path of dataset**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### **Importing required libraries and loading data**

In [None]:
# important python libraries for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # visualizing data
import seaborn as sns # visualizing data with stunning default theme
import sklearn # contain algorithms
import warnings
warnings.filterwarnings('ignore')

# load dataset from input directory
df = pd.read_csv("../input/videogamesales/vgsales.csv") 
df.head()

In [None]:
df.describe()

### The describe() method shows that count for <b>Year</b> is different which tells about missing values. The other statistics are also useful like min and max values in each column.

### **Removing duplicate and missing-value rows**

In [None]:
#dataset total rows without cleaning
print("Dataset shape before cleaning: ",df.shape)
#droping rows with at-least single missing value
data = df
data.dropna(how = "any",inplace = True)
print("Dataset shape after dropping row's with null valu's :",data.shape)
# droping duplicate values
data = data.drop_duplicates()
print("Dataset shape :",data.shape)


### **Rename columns and check missing value percentage**


In [None]:
#changing columns name into lower-case 
data.columns = map(lambda x: x.lower(), data.columns)
# let's rename some columns 
data.rename(columns = {"na_sales":"north_usa_sales",
                        "eu_sales":"europe_sales",
                        "jp_sales":"japan_sales"},inplace = True)


m_perc = pd.DataFrame(data.isnull().sum(),columns = ["missing percent"])
print(m_perc)


### **Feature Transformation**

In [None]:
data.info()

In [None]:
# feature transformation
data.Year = data.year.astype(int)

### **Exploring Data for different questions**  
#### **To find game with highest global sales**

In [None]:
max_sold = data.global_sales.max()
hgs_game = data[data.global_sales == max_sold]
hgs_game[["name","global_sales","year"]]

#### **To find game with lowest global sales**

In [None]:
min_sold = data.global_sales.min()
lgs_game = data[data.global_sales == min_sold]
lgs_game[["name","global_sales","year"]]

### **To visualize global average game sales year-wise**


In [None]:
plt.rcParams['figure.figsize'] = (15,10)
year_wise_game_sales  = pd.pivot_table(data ,index = "year" ,
                                       values = "global_sales",
                                       aggfunc = np.sum)
sns.barplot(year_wise_game_sales["global_sales"],year_wise_game_sales.index,orient = "h")
plt.title("Year wise global game sales :")

### **Top 10 platform wise global sale's**

In [None]:
plt.rcParams['figure.figsize'] = (8,6)
platform_wise_game_sales  = pd.pivot_table(data ,index = "platform",
                                           values = "global_sales",
                                           aggfunc = np.sum)
#platform_wise_game_sales.sort_values("global_sales",ascending=False).head()
platform_wise_game_sales  = platform_wise_game_sales.sort_values(
                    by = "global_sales",ascending  = False).head(10)

sns.barplot(platform_wise_game_sales["global_sales"],platform_wise_game_sales.index,orient = "h")
plt.title("Top 10 Platform wise global game sales :")

### **listing genre wise top global sale's**

In [None]:
plt.rcParams['figure.figsize'] = (8,6)
genre_wise_game_sales  = pd.pivot_table(data ,index = "genre",
                                        values = "global_sales",
                                        aggfunc = np.sum)
genre_wise_game_sales  = genre_wise_game_sales.sort_values(
    by = "global_sales",ascending  = False).head(10)
sns.barplot(genre_wise_game_sales["global_sales"],genre_wise_game_sales.index,orient = "h",palette = "husl")
plt.title("Top 10 Genre wise global game sales :")

### **Listing publisher wise top global game sales**

In [None]:
plt.rcParams['figure.figsize'] = (8,6)
publisher_wise_game_sales  = pd.pivot_table(data ,index = "publisher" ,values = "global_sales",aggfunc = np.sum)
publisher_wise_game_sales  = publisher_wise_game_sales.sort_values(by = "global_sales",ascending  = False).head(10)
sns.barplot(publisher_wise_game_sales["global_sales"],publisher_wise_game_sales.index,orient = "h",palette = "viridis")
plt.title("Top 10 Publisher wise global game sales :") 

### **Let's find Top 5 Best game's from Top Action Genre**

In [None]:
top_five_action_games = data[data.genre == "Action"][["name","global_sales"]]
top_five_action_games = top_five_action_games.sort_values(by = "global_sales",ascending = False )
top_five_action_games = top_five_action_games.drop_duplicates(["name"]).head(5)
sns.barplot(top_five_action_games["global_sales"],top_five_action_games["name"])
plt.title("Top Five Action Games And Their Sales World Wide : ")
top_five_action_games

### **Which is the highest sold Game in North_USA**

In [None]:
north_usa_highest_sold_game = data.north_usa_sales.max()
print("highest sold Game in North_USA :")
data[data["north_usa_sales"] == north_usa_highest_sold_game][["name","north_usa_sales"]]

### **Total Sale's Year Wise**

In [None]:
tot_sales_year_wise = pd.pivot_table(data,index = "year",values = "global_sales",aggfunc= np.sum)
#print(avg_sales_year_wise) #uncomment this to know Total sale's value's with respect each year.
plt.plot(tot_sales_year_wise.index,tot_sales_year_wise["global_sales"],color = 'g',marker = "*")
plt.title("Total Sale's Year Wise")
plt.xlabel("Year's")
plt.ylabel("avg global_sales")

### **Re-prepare data for model**

In [None]:
# dropping columns
data2 = data.copy()

In [None]:
def data_encode(x_data):
    for i in x_data.columns:
        x_data[i]=x_data[i].factorize()[0]
        
    return x_data    
    
x_data = data2.drop("global_sales",axis = 1)
y_data = data2["global_sales"]
x_data = data_encode(x_data)


### **Model Building**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score,mean_squared_error

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
xtrain,xtest,ytrain,ytest=train_test_split(x_data,y_data,test_size=.3,random_state=1)


### **Linear Regression**

In [None]:
lr_model =LinearRegression()
lr_model.fit(xtrain,ytrain)
ypred=lr_model.predict(xtest)
n = len(xtest)
p = xtest.shape[1]
r2_value = r2_score(ytest,ypred)
adjusted_r2_score = 1 - (((1-r2_value)*(n-1)) /(n-p-1))
print("r2_score for Linear Reg model : ",r2_score(ytest,ypred))
print("adjusted_r2_score Value       : ",adjusted_r2_score)                         
print("MSE for Linear Regression     : ",mean_squared_error(ytest,ypred))

### **Random Forest Regressor**

In [None]:
rf_model = RandomForestRegressor(n_estimators=200,min_samples_split=20,random_state=43)
rf_model.fit(xtrain,ytrain)
ypred = rf_model.predict(xtest)
n = len(xtest)
p = xtest.shape[1]
r2_value = r2_score(ytest,ypred)
adjusted_r2_score = 1 - (((1-r2_value)*(n-1)) /(n-p-1))
print("r2_score for Random Forest Reg model : ",r2_score(ytest,ypred))
print("adjusted_r2_score Value              : ",adjusted_r2_score)
print("MSE for Random Forest Regression     : ",mean_squared_error(ytest,ypred))