<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>📜 Introduction : </font></h3>
    
**Overview**
    
The mobile games industry is worth billions of dollars, with companies spending vast amounts of money on the development and marketing of these games to an equally large market. Using this data set, insights can be gained into a sub-market of this market, strategy games. This sub-market includes titles such as Clash of Clans, Plants vs Zombies and Pokemon GO.

**Background**
    
This is the data of 17007 strategy games on the Apple App Store. It was collected on the 3rd of August 2019, using the iTunes API and the App Store sitemap.

**Some ideas**
    
You could use the number of ratings as a proxy indicator for the overall success of a game, and then work out what factors make a successful game. Or you could measure the state of the market over time and try predict where it is headed.
And I think an analysis of the icons of the apps would be pretty cool.

<a id="1"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">Apple Application Store Strategy Games</h1>

<a id = "2"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">✨Importing Libraries & Reading Data✨</p>

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt
from imageio import imread

import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("/kaggle/input/17k-apple-app-store-strategy-games/appstore_games.csv", 
                 usecols=["ID",
                          "Name",
                          "Description",
                          "Primary Genre",
                          "Genres",
                          "Average User Rating"])

In [None]:
df.head()

<div style="border-radius: 10px; border: #6B8E23 solid; padding: 15px; background-color: #F5F5DC; font-size: 100%; text-align: left">

<h3 align="left"><font color='#556B2F'>👀 Features : </font></h3>
    
* **ID:** Game ID
* **Name:** Game Name
* **Average User Rating:** Average player rating
* **Description:** Description of the game content
* **Primary Genre:** Primary genre
* **Genre:** The category the game falls into

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace=True)

In [None]:
df = df[df["Primary Genre"]=="Games"]
df

In [None]:
df.duplicated().sum() # has 71 duplicate row

In [None]:
df = df.drop_duplicates() # dropped duplicates rows

In [None]:
df[(df["Description"].apply(lambda x: len(x)).sort_values() < 30)].head(50)

In [None]:
df = df.loc[~(df["Description"].str.len() < 30)]
df

<a id = "3"></a><br>
<p style="font-family: 'Pacifico', cursive; font-weight: bold; letter-spacing: 2px; color: #556B2F; font-size: 160%; text-align: left; padding: 0px; border-bottom: 3px solid">🗒️ NLP (Natural Language Processing)🗒️</p>

In [None]:
def decode(column):
    column = column.str.decode("unicode_escape")\
    .str.replace(r'[^a-zA-Z1-9\ ]', '', regex=True).str.strip()
    return column

In [None]:
df.loc[:,"Name"] = decode(df["Name"])

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* The `decode` function cleans text data in a column, preserving only English letter characters, numbers, and whitespace, while removing leading and trailing spaces.

In [None]:
df.reset_index(inplace=True)

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* Reseted indexes before TF-idf operations.

In [None]:
nlp = spacy.load('en_core_web_sm')

def lemmatize(text):
    doc = nlp(text)
    tokens = [token for token in doc if not token.is_punct]
    lemmas = [token.lemma_ if token.pos_ != 'PRON' else token.orth_ for token in tokens]
    return lemmas

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
This code processes incoming text data using SpaCy to determine the base form of each word, and it returns a list of lemmatized words.

In [None]:
tfidf = TfidfVectorizer(stop_words = "english", tokenizer = lemmatize)
tfidf_matrix = tfidf.fit_transform(df["Description"])
tfidf.get_feature_names_out()

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
This code is used to convert text data into TF-IDF (Term Frequency-Inverse Document Frequency) vectors and creates a matrix with TF-IDF weights for each word. Additionally, it obtains a list of feature names that represent which words are included in these vectors.

In [None]:
cosine_sim = cosine_similarity(tfidf_matrix,
                               tfidf_matrix)

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
Cosine similarity is a metric used to measure how similar two vectors (TF-IDF vectors in this case) are to each other. The smaller the angle between two vectors, the higher the cosine similarity, indicating how similar the two vectors are to each other.

In [None]:
# This code returns the index of the first row that contains the word "pubg" in the "Name" column.

index = df[df["Name"].str.contains(r"pubg", regex=True,case=False)].drop_duplicates().index[0]
index

In [None]:
similarity_scores = pd.DataFrame(cosine_sim[index],
                                 columns=["score"])

game_indices = similarity_scores.sort_values("score", ascending=False)[1:10].index

df['Name'].iloc[game_indices]

In [None]:
df["Name"].iloc[game_indices].str.decode("unicode_escape").str.replace(r'[^a-zA-Z1-9\ ]', '', regex=True).str.strip()

In [None]:
def icons(recom):
    icons = pd.read_csv("/kaggle/input/17k-apple-app-store-strategy-games/appstore_games.csv",
                        usecols=["Icon URL","Name"])
    icons.loc[:,"Name"] = decode(icons["Name"])
    icon_urls = icons[["Name","Icon URL"]][icons["Name"].isin(recom)]
    return icon_urls

In [None]:
names_links = icons(df['Name'].iloc[game_indices])

In [None]:
plt.figure(figsize=(10,9))
for i, title_img in enumerate(names_links.values):
    plt.subplot(4,3,i+1)
    img = imread(title_img[1])
    plt.imshow(img)
    plt.title(title_img[0])
    plt.tight_layout()
    plt.axis("off")

<a id="4"></a>
<h1 style="border-radius: 10px; border: 2px solid #6B8E23; background-color: #F5F5DC; font-family: 'Pacifico', cursive; font-size: 200%; text-align: center; border-radius: 15px 50px; padding: 15px; box-shadow: 5px 5px 5px #556B2F; color: #556B2F;">Popular Video Games Dataset</h1>

In [None]:
vg = pd.read_csv("/kaggle/input/popular-video-games-1980-2023/games.csv", 
                 usecols=["Title","Rating","Genres","Summary"])
vg.head()

In [None]:
"Left For Dead" in vg["Title"]

In [None]:
vg.isnull().sum()
vg.dropna(inplace=True,axis=0)
vg

In [None]:
vg.duplicated().sum()
vg.drop_duplicates(inplace=True)
vg.head()

<div style="border-radius:10px; border:#65647C solid; padding: 15px; background-color: #F8EDE3; font-size:100%; text-align:left">

<h3 align="left"><font color='#7D6E83'><b>🗨️ Comment: </b></font></h3>
    
* We will follow the same steps.

In [None]:
vg.loc[:,"Summary"] = decode(vg["Summary"])

In [None]:
vg.reset_index(inplace=True, drop=True)

In [None]:
tfidf2 = TfidfVectorizer(stop_words="english", tokenizer=lemmatize)
tfidf_matrix2 = tfidf2.fit_transform(vg['Summary'])
tfidf2.get_feature_names_out()

In [None]:
cosine_sim2 = cosine_similarity(tfidf_matrix2,
                               tfidf_matrix2)
cosine_sim2

In [None]:
index2 = vg[vg["Title"].str.contains(r"Valorant", regex=True,case=False)].drop_duplicates().index[0]

In [None]:
similarity_scores2 = pd.DataFrame(cosine_sim2[index2],
                                 columns=["score"])

games2 = similarity_scores2.sort_values("score", ascending=False)[1:6].index

vg['Title'].iloc[games2]

<center><img src="https://i.imgur.com/NroW500.png" width="600" height="600"></center>