# This is an Exploratory Analysis on Steam Game Statistics Focusing on the Rating of Games and Hours Played on steam

## Research Question or Motivation For this Project/Analysis

### ัฒ To Determine the relation if any between the rating of a game and its market Success (Measured as Hours the game was played)

### The Data Sets Used in this Analysis :-

#### 1. [MetaCritic Rating DataSet On Kaggle for Games On Steam](https://www.kaggle.com/skateddu/metacritic-games-stats-20112019)

####     -> This Data Set contains different ratings and user feedback on different games on steam which are used to calculate the Normalized Rating Score for each game in this Analysis

#### 2. [Steam Game Played by Hours DataSet On Kaggle](https://www.kaggle.com/tamber/steam-video-games)

####     -> This Data Set contains different game played on steam along with the number of hours the game was played by different users this information is used to calculate Normalized Play Score for each game in this Analysis

$-----------------------------------------------------------$

## Importing The Required Libraries and Magic


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as col
import matplotlib.cm as cm
import seaborn as sns

%matplotlib inline

In [None]:
plt.style.use('seaborn-white')

## Uploading Data Sets

#### 1. [MetaCritic Rating DataSet On Kaggle for Games On Steam](https://www.kaggle.com/skateddu/metacritic-games-stats-20112019)

####     -> This Data Set contains different ratings and user feedback on different games on steam which are used to calculate the Normalized Rating Score for each game in this Analysis

In [None]:
# Data Set for gaming hours of steam

st = pd.read_csv('../input/steam-video-games/steam-200k.csv',names=['User_ID','Game','Action','Value','Other'])
st.head(10)

#### 2. [Steam Game Played by Hours DataSet On Kaggle](https://www.kaggle.com/tamber/steam-video-games)

####     -> This Data Set contains different game played on steam along with the number of hours the game was played by different users this information is used to calculate Normalized Play Score for each game in this Analysis

In [None]:
# Data Set for Metacritic Ratings

rt = pd.read_csv('../input/metacritic-games-stats-20112019/metacritic_games.csv')
rt.head(10)

## Data Cleaning


In [None]:
#Removing User ID's as they are of no interest in this EDA 

st = st[['Game','Action','Value']]

#Taking only play hours of the games and dropping the absolete action column

st = st[st['Action'] == 'play']
st.drop(['Action'],inplace=True,axis=1)
st.rename(columns={'Value':'Hours Played'},inplace=True)

#Taking Cumulative Playing time of all the users

st = st.groupby('Game').sum().reset_index()
st = st.sort_values('Hours Played',ascending = False).reset_index(drop=True)

#top 10 Most Played Games

st.head(10)

In [None]:
#Keeping only relevant columns

col = rt.columns
rt = rt[col[:2].tolist()+col[7:-2].tolist()]

#Taking only PC games
rt = rt[rt['platform'] == 'PC']
rt.drop(['platform'],axis=1,inplace=True)

rt.head()

## Generating a rating index by using the overall ratings of both the critics and users

### Assigning scroes to all the rating

- Postive Critic = + 1
- Neutral Critic = + 0.5
- Negative Critic = - 1
- Positive User = + 1
- Neutral User = + 0.5
- Negative User = - 1

### Calculating the final Score

In [None]:
#Score

rt['neutral_critics'] = rt['neutral_critics']*0.5
rt['negative_critics'] = rt['negative_critics']*(-1)
rt['neutral_users'] = rt['neutral_users']*0.5
rt['negative_users'] = rt['negative_users']*(-1)
rt['Score'] = rt['positive_critics'] + rt['neutral_critics'] + rt['negative_critics'] + rt['neutral_users'] + rt['negative_users'] + rt['positive_users']
rt = rt[['game','Score']].rename(columns={'game':'Game'})
rt = rt.sort_values('Score',ascending=False).reset_index(drop=True)

#Top 10 Rated Games
rt.head(10)

### Conclusion 1. The set of Most Played games and Top rated games turns out to be  disjoint. As an outcome of which the EDA from this point on will be divided into two section namely :-
  1.  Comparing the Play Time of Most Played games with thier respective rating.
###    
  2.  Comparing the Play Time of Top Rated games with thier respective Play Time.


In [None]:
# Developing the merge Datasets

final = pd.merge(st,rt,how='inner',left_on='Game',right_on='Game')
topst = final.sort_values('Hours Played',ascending=False)[2:12]
topst

In [None]:
toprt = final.sort_values('Score',ascending=False)[2:12]
toprt

### To have the rating and the play time on the same scale we have to Normalize Both values to a scale of [0,2]
### Normalizing the Play Time To a Range of [0,2] by :-
   ##       $ Normalized~Hours =  (\frac{Hours - Min(Hours)}{Max(Hours) - Min(Hours)})*2$      
### Normalizing the Final Score To a Range of [0,2] by :-
   ##       $ Normalized~Score =  (\frac{Score - Min(Score)}{Max(Score) - Min(Score)})*2$

In [None]:
#Function to normalize the values

def Normalize(lst):
    norm = []
    mx = max(lst)
    mn = min(lst)
    for i in lst:
        norm.append( ((i - mn) / (mx - mn)) * 2 )
    return norm

In [None]:
# Normalizing the Scores

topst['Play Score'] = Normalize(topst['Hours Played'].tolist())
topst['Rating Score'] = Normalize(topst['Score'].tolist())
topst.reset_index(drop=True,inplace=True)
topst

In [None]:
# Normalizing the Scores

toprt['Play Score'] = Normalize(toprt['Hours Played'].tolist())
toprt['Rating Score'] = Normalize(toprt['Score'].tolist())
toprt.reset_index(drop=False,inplace=True)
toprt

## Data Visualiztion

In [None]:
#testing for trends in most played games

fig1 = plt.figure(figsize=(8,6.5))
plt.plot(topst['Play Score'],'-o',label='Play Score',c='orange')
plt.plot(topst['Rating Score'],'-o',label='Rating Score',c='b')
plt.legend(title='Legend');
spines1 = plt.gca().spines
spines1['right'].set_visible(False)
spines1['top'].set_visible(False)
spines1['left'].set_visible(False)
spines1['bottom'].set_visible(False)
plt.grid()
plt.title('Play Time of Most Played games V/S Rating Score.');
plt.xlabel('Top Game Ranks');
plt.ylabel('Normalized Score Scale');

In [None]:
#testing for trends in top rated games

fig2 = plt.figure(figsize=(8.,6.5))
plt.plot(toprt['Play Score'],'-o',label='Play Score',c='orange')
plt.plot(toprt['Rating Score'],'-o',label='Rating Score',c='b')
plt.legend()
spines2 = plt.gca().spines
spines2['right'].set_visible(False)
spines2['top'].set_visible(False)
spines2['left'].set_visible(False)
spines2['bottom'].set_visible(False)
plt.title('Play Time of Top Rated games V/S Rating Score.');
plt.xlabel('Top Game Ranks');
plt.ylabel('Normalized Score Scale');
plt.legend(title='Legend')
plt.grid()

# Final Viusalization 

In [None]:
canv,((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2)
canv.set_size_inches(18,15)
canv.tight_layout(pad=5.0)

#First Plot,(0,0)

plt.sca(ax1)
plt.rcParams.update({'font.size': 14})
bars = plt.barh(np.arange(len(topst['Game'])),topst['Hours Played'].iloc[::-1],color='lightslategrey',alpha=0.7)
bars[-1].set_color('orange')

for bar,name,value in zip(bars,topst['Game'].iloc[::-1].tolist(),topst['Hours Played'].iloc[::-1].tolist()):
    plt.text((bar.get_width()/4)-2500,(bar.get_y()+0.3),name + ' ({:.0f} Hours)'.format(value),color='w',fontweight='bold',fontsize=13)

plt.yticks(np.arange(len(topst['Game'])),np.array([10,9,8,7,6,5,4,3,2,1]));
ax1.set_xticks([])
plt.xlabel('Hours Played on Steam.',fontsize=15)
plt.ylabel('Ranking Based on Play Time on Steam.',fontsize=15)

for spine in plt.gca().spines.values():
    spine.set_visible(False)

#Second Plot,(0,1)

plt.sca(ax2)
plt.rcParams.update({'font.size': 14})
plt.plot(topst['Play Score'],'-o',label='Play Score',c='orange')
plt.plot(topst['Rating Score'],'-o',label='Rating Score',c='b')
plt.legend(title='Legend');

for spine in plt.gca().spines.values():
    spine.set_visible(False)

plt.grid(alpha=0.8)
plt.title('Play Time of Most Played games V/S Rating Score.');
plt.xticks(np.arange(len(topst['Game'])),np.array([1,2,3,4,5,6,7,8,9,10]));
plt.xlabel('Ranking Based on Play Time on Steam.',fontsize=15);
plt.ylabel('Normalized Score Scale.',fontsize=15);

#Third Plot,(1,0)

plt.sca(ax3)
plt.rcParams.update({'font.size': 14})
bars = plt.barh(np.arange(len(toprt['Game'])),toprt['Score'].iloc[::-1],color='lightslategrey',alpha=0.7)
bars[-1].set_color('b')

for bar,name,value in zip(bars,toprt['Game'].iloc[::-1].tolist(),toprt['Score'].iloc[::-1].tolist()):
    plt.text((bar.get_width()/4-10),(bar.get_y()+0.3),name + ' ({:.0f})'.format(value),color='w',fontweight='bold',fontsize=13)

for spine in plt.gca().spines.values():
    spine.set_visible(False)

plt.yticks(np.arange(len(toprt['Game'])),np.array([10,9,8,7,6,5,4,3,2,1]));
ax3.set_xticks([])
plt.xlabel('Rating Score on Steam.',fontsize=15)
plt.ylabel('Ranking Based on Rating Score on Steam.',fontsize=15)

#Fourth Plot(1,1)

plt.sca(ax4)
plt.rcParams.update({'font.size': 14})
plt.plot(toprt['Play Score'],'-o',label='Play Score',c='orange')
plt.plot(toprt['Rating Score'],'-o',label='Rating Score',c='b')
plt.legend()

for spine in plt.gca().spines.values():
    spine.set_visible(False)

plt.title('Play Time of Top Rated games V/S Rating Score.');
plt.xlabel('Ranking Based on Rating Score on Steam.',fontsize=15);
plt.xticks(np.arange(len(toprt['Game'])),np.array([1,2,3,4,5,6,7,8,9,10]));
plt.ylabel('Normalized Score Scale.',fontsize=15);
plt.legend(title='Legend');
plt.grid(alpha=0.8)

## Corollary :-
### First, As we can see there are really next to no co-relation between the rating and the play time of a game, This Peculiar phenomena is described below:

1. The rating system is found to be skewed along the genre of a game i.e if a game is a FPS(first person shooting) type game then you cannot compare its rating to the rating of a non FPS game

2. The Rating is developed as soon as the game hits the market whereas the play time increases gradually over the years therefor new game with better rating has infinitesimally small chance of having a play time more than the older games

3. Then there are the challenge of competitive games which demands rigorous play time and determination, this act skews our analysis

4. The nature of rating and repeated gaming is highlighted in the visualization in the fact that the set of top rated games and the set of most played games is disjoint in nature

## Final Summary:
#### To Further increase the accuracy of this analysis clustering is a good option and as of now rating and Play Hours Does not carry or Does but very little co-relation and hence rating is not a good indictor of a games market success