![logo](./images/OPTIMISE.%20Logo%20(green).png)

In this project I role play as a BI consultant from Optimise. using Steam as my client. I provide a full sales analysis & a sales prediction model based on key game features.

This is Notebook 4 out of 4 of this project.

# Optimise.
BUSINESS INTELLIGENCE SOLUTIONS

Optimise. uses data analysis to provide businesses a vision of their present operations and provides them with actionable advise based on meticulous analysis that produces tangible results.   

The analysis focuses on these main areas:     
- Product Analysis
    - Performance
    - Classification
    - Pricing
- Customer Analysis
    - Customer Profile
    - Customer Trends
    - Customer Lifetime Value
- Sales Analysis
    - Date/Time Overview
    - Discount Effeciency
    - Projections
    
The deliverables to be expected are a comprehensive report with useful visualizations, combined with specific recommendations based on the results obtained from the analysis.

# Final Working Model
In this Notebook I recollect the final version of the logistic model created for this project. It will predict the umber of sales of a game based on its genre, category, platform, publisher(top-tier or not), and developer (top tier or not).

### Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as sp
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.utils import resample
from IPython.display import clear_output
from sklearn import metrics
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [2]:
s = pd.read_csv("data/steam_cols_clean.csv")

In [3]:
# grouping boolean columns by type

platforms = ['linux', 'windows', 'mac']
genres = ['Indie', 'Sports', 'Simulation', 'Strategy', 'Early Access', 'Casual',
       'RPG', 'Free to Play', 'Adventure', 'Action', 'Racing']
categories = ['Includes level editor', 'MMO', 'VR Support', 'Single-player',
       'Controller Support', 'Online', 'Multi-Player', 'co-op', 'Local']
tags = ['Nudity', 'Retro', 'Violent', 'Visual Novel', 'RPGMaker', 'Fighting',
       'FPS', 'Female Protagonist', 'Board Game', 'Space', 'World War II',
       'Platformer', 'Anime', 'Great Soundtrack', 'Massively Multiplayer',
       'Open World', 'Sexual Content', 'Arcade', 'Gore', 'Pixel Graphics',
       'Turn-Based', 'Music', 'Fantasy', 'Point & Click', 'Rogue-like',
       'World War I', "Shoot 'Em Up", 'RTS', 'Story Rich', 'Hidden Object',
       'Turn-Based Strategy', 'Survival', 'Match 3', 'Horror', 'Puzzle',
       'Sci-fi', 'Tower Defense', 'VR', 'Management', '2D', 'Card Game',
       'Multiplayer', 'Utilities', 'Shooter', 'War', 'Co-op', 'Zombies',
       'Classic', 'Singleplayer']

#### Developer & Publisher Adjustments

In [4]:
dev_count = s["developer"].value_counts()
len(dev_count[dev_count>9])/len(dev_count)*100

1.016770875942266

In [5]:
s["top_developer"] = 0
values = dev_count[dev_count>19].index.tolist()

for w in values:
    for i in range(len(s)):
        if w in s["developer"][i]:
            s["top_developer"][i] = 1
            
s["top_developer"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


0    25806
1     1269
Name: top_developer, dtype: int64

In [6]:
dev_count = s["publisher"].value_counts()
len(dev_count[dev_count>16])/len(dev_count)*100

0.996237982443918

In [7]:
s["top_publisher"] = 0
values = dev_count[dev_count>49].index.tolist()

for w in values:
    for i in range(len(s)):
        if w in s["publisher"][i]:
            s["top_publisher"][i] = 1
            
s["top_publisher"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


0    24622
1     2453
Name: top_publisher, dtype: int64

## Optional Data Adjustments
I have attached in here 2 different ways of clustering the target data.

1. Grouping it into 3 groups instead of 13 - it disminished the options significantly but maintains significance.
2. Grouping it into 2 groups instead of 13 - it allows for binary model building but reduces the significance of the results considerably.

The adjustments will not overwrite the original dataframe so in order to be used in the model they will either have to be assigned to the dataframe or adjust the model to retrieve data from the modified dataframe.

#### Three Target Groups

In [None]:
def reassign_owners(row):
    a = ["20000-50000", "50000-100000", "100000-200000", "200000-500000"]
    b = ["500000-1000000", "1000000-2000000", "2000000-5000000", "5000000-10000000", "10000000-20000000", 
         "20000000-50000000", "50000000-100000000", "100000000-200000000"]
    if row in a:
        row = "20000-500000"
    elif row in b:
        row = "500000+"
    return row

s["owners"] = s["owners"].apply(lambda x: reassign_owners(x))

s["owners"].value_counts()

In [None]:
# Deal with scaling
majority = s[s['owners'] == "0-20000"]
minority1 = s[s['owners'] == "20000-500000"]
minority2 = s[s['owners'] == "500000+"]
 
# Downsample majority class
minority1_upsampled = resample(minority1, n_samples=len(majority)) #random_state=123
minority2_upsampled = resample(minority2, n_samples=len(majority))

# Combine minority class with downsampled majority class
owners_upsampled = pd.concat([minority1_upsampled, minority2_upsampled, majority])

owners_upsampled['owners'].value_counts()

#### Two Target Groups

In [None]:
def reassign_owners_binary(row):
    a = ["20000-50000", "50000-100000", "100000-200000", "200000-500000"]
    b = ["500000-1000000", "1000000-2000000", "2000000-5000000", "5000000-10000000", "10000000-20000000", 
         "20000000-50000000", "50000000-100000000", "100000000-200000000"]
    if row in a:
        row = "20000+"
    elif row in b:
        row = "20000+"
    return row

s["owners"] = s["owners"].apply(lambda x: reassign_owners_binary(x))

s["owners"].value_counts()

In [None]:
# Deal with scaling
majority = s[s['owners'] == "0-20000"]
minority = s[s['owners'] == "20000+"]
 
# Downsample majority class
minority_upsampled = resample(minority, n_samples=len(majority)) #random_state=123

# Combine minority class with downsampled majority class
owners_upsampled_binary = pd.concat([minority_upsampled, majority])

owners_upsampled_binary['owners'].value_counts()

## Logistical Regression Model Foundation


In [8]:
def logistic_regression_model_full(platform, genre, category, developer, publisher):
    # Assign values x
    x = s[s[genre]==1]
    x = x[x[category]==1]
    x = x[x[platform]==1]
    if developer == "yes":
        x = x[x["top_developer"]==1]
    else:
        x = x[x["top_developer"]==0]
    if publisher == "yes":
        x = x[x["top_publisher"]==1]
    else:
        x = x[x["top_publisher"]==0]
    x.drop(["appid", "name", "release_date",'english', "developer", "publisher", 'required_age', 'achievements', 'positive_ratings',
            'negative_ratings', 'average_playtime', 'median_playtime', 'owners', "price"], axis=1, inplace=True)
    platforms_copy = platforms.copy()
    platforms_copy.remove(platform)
    x.drop(platforms_copy, axis=1, inplace=True)
    genres_copy = genres.copy()
    genres_copy.remove(genre)
    x.drop(genres_copy, axis=1, inplace=True)
    cat_copy = categories.copy()
    cat_copy.remove(category)
    x.drop(cat_copy, axis=1, inplace=True)
    # Assign values y
    y = s[s[genre]==1]
    y = y[y[category]==1]
    y = y[y[platform]==1]
    if developer == "yes":
        y = y[y["top_developer"]==1]
    else:
        y = y[y["top_developer"]==0]
    if publisher == "yes":
        y = y[y["top_publisher"]==1]
    else:
        y = y[y["top_publisher"]==0]
    y = y["owners"]
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    # Generate the model
    model = LogisticRegression(max_iter=1000000).fit(X_train, y_train)
    # Get score of the model
    score = model.score(X_test,y_test)
    # Predict using test data
    y_pred = model.predict(X_test)
    # Generate matrix
    matrix = confusion_matrix(y_test, y_pred)
    print("\nThe expected amounts of sales for this game fall in the range", list(sp.mode(y_pred)[0])[0])
    print("With an accuracy score of", round(score*100, 2), "%")
    #Data Visualization
    return #plt.plot(y_pred)

In [9]:
def model_full():
    print("This model will return the predicted number of sales based on key features like platform, genre, and categories.\n")
    #n = input("How many features do you want to test?")
    print("\nPlatforms:", platforms, "\n")
    platform = input("Choose Platform:\n")
    print("\nGenres:", genres, "\n")
    genre = input("Choose Genre:\n")
    print("\nCategories", categories, "\n")
    category = input("Choose Category:\n")
    developer = input("Has the developer developed 20 games or more?")
    publisher = input("Has the publisher published 50 games or more?")
    result = logistic_regression_model_full(platform, genre, category, developer, publisher)
    return print("\n", result)

# THE MODEL

In [13]:
model_full()

This model will return the predicted number of sales based on key features like platform, genre, and categories.


Platforms: ['linux', 'windows', 'mac'] 

Choose Platform:
windows

Genres: ['Indie', 'Sports', 'Simulation', 'Strategy', 'Early Access', 'Casual', 'RPG', 'Free to Play', 'Adventure', 'Action', 'Racing'] 

Choose Genre:
RPG

Categories ['Includes level editor', 'MMO', 'VR Support', 'Single-player', 'Controller Support', 'Online', 'Multi-Player', 'co-op', 'Local'] 

Choose Category:
Online
Has the developer developed 20 games or more?no
Has the publisher published 50 games or more?yes

The expected amounts of sales for this game fall in the range 35000
With an accuracy score of 0.0 %

 None
