Link to presentation can be found Here
Data Source: https://www.kaggle.com/brunovr/metacritic-videogames-data
For the final project, the team has decided to analyze the video game industry based on our shared interest in video games. Video games are continuously evolving in the tech industry. It has become a popular culture and provides a fun and social form of entertainment. Its relevance to the tech will be applicable for our tech career. The team has sourced the video game dataset from Kaggle (found in the link above). Based on the dataset, our objective is to predict the most anticipated video game franchise based on sales and scores.
- Based on how many metacritics (user) and critics, are we able to predict the score for an anticipated video game?
- Based on the review scores (critic and users) and franchise, are we able to to predict video game sales and its projected growth?
- Does genre, review scores, platform, developer, franchise, and release dates have any impact on video game (vg) sales?
The team will analyze which is the most anticipated game/franchise release and by which developer, which game would have the highest rating in the future, which game developer has the best score, as well as what genre of games is most popular by number of players (single vs. multiplayer).
Python, Pandas, Jupyter Notebook, SQL, SQLAlchemy, Tableau, Excel For further information on what techonology we'll be using and how, please see the techonology.md file.
- Module 20 First Segment Project Deliverable due March 6th, 2022
- Module 20 Second Segment Project Deliverable Due March 20th, 2022
- Module 20 Third Segment Project Deliverable Due March 27th, 2022
- Module 20 Final Segment Project Deliverable Due March 31st 2022
Team Member Role Description
- Rasheem G.- X: The X role will focus on the technology side of the project uploaded readMe on his branch for project plan
- Caitlin B.- Circle: The circle role will oversee the mockup database uploaded mockup database as .csv file on main branch
- Edin C.- Triangle: The triangle role is responsible for creating a simple machine learning model. please see Edin's branch for machine learning model
- Tasnuva M.- Square: The square role will be responsible for setting up the repository. pushed and uploaded README and technologiy file to main branch
The roles above will apply for the first segment of the module 20. As required, we will plan to rotate the X role among team members. In the upcoming week the team will determine the rotation of the X role to align with the deliverable deadlines.
Team Member Role Description
- Rasheem G.- Square: The square role will refine the machine learning model using (train and test) uploaded readMe on his branch for project plan
- Caitlin B.- Circle: The circle role will clean and analyze data and create visuals to accompany data story uploaded mockup database as .csv file on main branch
- Edin C.- Triangle: The triangle role is responsible for transforming the mockup database into a full database that integrates with data. please see Edin's branch for machine learning model
- Tasnuva M.- X: The X role will outline and conceptualize dashboard and presentation as well as update the repository. pushed and uploaded README and technologiy file to main branch
The roles above will apply for the second segment of the module 20. As required, we will plan to rotate the X role among team members. In the upcoming week the team will determine the rotation of the X role to align with the deliverable deadlines.
*all members have 4 commits in their branches.
Sample data set for machine learning module:
- Dropping Player column.
- Clean & use 2021 Sales per game csv data for additional info / join requirement (postgres).
- All potential duplicate values are an exact match for easy filtering, reducing numerical/grammatical errors.
- "Release Date" column formayted mm-dd-yy.
- Format "Genre" column for grouping.
- Create new column determining "Franchise"
- Verify "Critics" column usage.
- Line or Logistic Regression model for determining Developer/Franchise success rate.
- name: The name of the game
- platform: Platform it was released
- r-date: date it was released
- score: average score given by critics (metascore)
- user score: average score given by users in the website
- developer: game developer
- genre: genre of the game (can be multiple)
- players: Number of players (DROP THIS COLUMN due to missing information)
- critics: number of critics reviewing the game
- users: Number of metacritic users that reviewed the game
- Critic Scores/User Scores by Developer
- Critic Scores/User Scores by Platform
- Developer by Release Date
- Franchise Scores by Release Date
- Franchise by Critics
- Critics/Score by Genere
To prepare the data for our Machine Learning model we must first start by importing our data into the model and creating a data frame. Our team cleaned the data before importing it to better fit the model and produce more accurate results, since the data is pulled from a public website (www.Kaggle.com) we must ensure that we standardize the data to ensure the integrity of our analysis.
We found that the data was mainly in the Wrong Form and with some work could be standardized or repaired. As an example, there were inconsistencies with the player column, and we decided to drop it since it would not have an impact on the outcome of our analysis.
We then left joined another data set that included global sales to add another parameter for our model.
Feature engineering is an important part of the Machine Learning process as taking the time to apply the process will improve the quality of our results. Some of the features that we will extract from the raw data include adding an additional franchise column to show game success as a franchise as opposed to just per title and an additional column to show how long the game has been out since this would give a title a bigger timeframe to be purchased.
During our decision making we came to the conclusion that having our model answer our question of whether scores affected global sales would be difficult with scores varying from 60 to 100. Therefore we decided to to split the score column into tiers incrementing in 10's to give our model a better opportunity at success. Our main features included platform, main genre, critics, developer, years since and franchise because they had the highest correlation to tiers. Our data was split into the standard 80% training with 20% testing model, we believed this gave our model enough data to be properly trained. In our selection of the machine learning model we came to the conclusion that a RandomForestClassifier was our best option. This is because RandomForestClassifiers handle low correlation data very well. Unfortunately many of our columns had low correlation or none at all, making this model ideal for our data.
The limitations of this model is that it can only intake tabular data which led to our team adding the additional tier column from our scores. Our inital model was a linear regression model but upon further analysis we realized that there just wasn't enough of a correlation for it to be used effectively. That led our team to switch to a RandomForestClassifier which performed monumentally better. The model's current accuracy score is 91.8%.
We used Pandas and SQLAlchemy to load the csv files that we started with, and merged tables that we created into a SQL database. Our database is then able to interact with our machnie learning model, which allows us to make predictions and analyze our results.
We will be using Tableau and Pandas to build visualizations of our analysis which we will showcase on Google Slides.
For the interactive portion, we will build the options to manipulate the features for the graphs.
Link to Live Dashboard with Interactive Element for User Discovery: here