<center><h1 style="font-size: 64px;">Pokémon Analysis Final Project</h1></center>


<center><h1 style="font-size: 36px;">Author: Saransh Rakshak | Course: DSA-8640 | Due: Dec. 11, 2024</h1></center>

![Cover Image](poke_cover_image.jpg)

In [None]:
# import libraries
import os
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime

# save data locations/paths
data_path = "pokemon_data/"
data_folders = os.listdir(data_path)
data_quick = data_folders[:2]

<div style="background-image: url('./pika_scaled_title_back.png'); 
            background-size: contain;
            background-position: center;
            background-repeat: no-repeat; 
            color: black;
            margin: 0 auto; 
            padding: 50px; 
            text-align: center; 
            border-radius: 10px;"><center><h1 style="font-size: 54px;"><b><i>Part 1: Web Scrapping</i></b></h1></center>

- The first step is to extract various values from the raw HTML files. You can use BeautifulSoup or other Python modules.

### ‎

> <center><h1>🔹 🔹 🅐 🔹 🔹</h1></center>
>
> From all the iOS pages (ending with **“_ios.html”**), extract (i) number of customer ratings in the Current Version (let’s call it *ios_current_ratings*); and (ii) number of customer ratings in All Versions (*ios_all_ratings*). For example, the extracted values should be:
>
> <center><i>4688, 106508</i></center> for the file: <center><b>“2016-07-21/00_00_pokemon_ios.html”</b></center> 
>
> There are 2 values from iOS pages.

In [None]:
# helper function to parse iOS html file
def parse_ios_html(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file, "html.parser")
        
        # ios_current_ratings
        curr_ratings_div = soup.find("div", string="Current Version:")
        curr_ratings_span = curr_ratings_div.find_next("span", class_="rating-count") if curr_ratings_div else None
        ios_current_ratings = int(curr_ratings_span.text.split()[0].replace(",", "")) if curr_ratings_span else None
        
        # ios_all_ratings
        all_ratings_div = soup.find("div", string="All Versions:")
        all_ratings_span = all_ratings_div.find_next("span", class_="rating-count")
        ios_all_ratings = int(all_ratings_span.text.split()[0].replace(",", "")) if all_ratings_span else None
        
        return {"ios_current_ratings": ios_current_ratings, "ios_all_ratings": ios_all_ratings}

In [None]:
ios_dict = {}

# pull ios html data using parse_ios_html()
for folder in data_folders:
    for filename in os.listdir(data_path + folder):
        if filename.endswith("_ios.html"):
            file_path = os.path.join(data_path, folder, filename)
            rating = parse_ios_html(file_path)
            ios_index = (folder + "/" + filename)
            if ios_index not in ios_dict: ios_dict[ios_index] = rating
            ios_dict[ios_index].update(parse_ios_html(file_path))

# construct data frame
ios_data = pd.DataFrame.from_dict(ios_dict, orient = "index")

In [4]:
ios_data

Unnamed: 0,ios_current_ratings,ios_all_ratings
2016-07-21/00_00_pokemon_ios.html,4688.0,106508
2016-07-21/00_10_pokemon_ios.html,4688.0,106508
2016-07-21/00_20_pokemon_ios.html,4688.0,106508
2016-07-21/00_30_pokemon_ios.html,4688.0,106508
2016-07-21/00_40_pokemon_ios.html,4688.0,106508
...,...,...
2016-07-31/23_10_pokemon_ios.html,17856.0,139213
2016-07-31/23_20_pokemon_ios.html,22193.0,143350
2016-07-31/23_30_pokemon_ios.html,22193.0,143350
2016-07-31/23_40_pokemon_ios.html,22193.0,143350


In [5]:
ios_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1584 entries, 2016-07-21/00_00_pokemon_ios.html to 2016-07-31/23_50_pokemon_ios.html
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ios_current_ratings  1552 non-null   float64
 1   ios_all_ratings      1584 non-null   int64  
dtypes: float64(1), int64(1)
memory usage: 37.1+ KB


In [None]:
# Compare with expected answers
print(all(ios_data.loc['2016-07-21/00_00_pokemon_ios.html'].values == [4688, 106508]))

True


In [7]:
ios_data.loc['2016-07-21/00_00_pokemon_ios.html']

ios_current_ratings      4688.0
ios_all_ratings        106508.0
Name: 2016-07-21/00_00_pokemon_ios.html, dtype: float64

## ‎

> <center><h1>🔹 🔹 🅑 🔹 🔹</h1></center>
>
> From all the Android pages (ending with **“_android.html”**), extract (i) average rating (in the scale between 1.0 and 5.0) (*android_avg_rating*); (ii) number of total ratings (*android_total_ratings*); and (iii) number of ratings for 1-5 stars (*android_ratings_1*, *android_ratings_2*, …, *android_ratings_5*). For example, the extracted values should be: 
>
> <center><i>3.9, 1281802, 199974, 71521, 117754, 165956, 726597</i></center>
>
> for the file:
>
> <center><b>“2016-07-21/00_00_pokemon_android.html”</b></center>
> 
> There are 7 values from Android pages.

In [None]:
# helper function to parse android html file
def parse_android_html(file_path):
    android_star_ratings = {'one' : [], 'two' : [], 'three' : [], 'four' : [], 'five' : []}
    with open(file_path, "r", encoding="utf-8") as file:
        soup = BeautifulSoup(file, "html.parser")
        
        # android_avg_rating
        avg_rating_div = soup.find("div", class_="score")
        android_avg_rating = float(avg_rating_div.text.strip())
        
        # android_total_ratings
        total_ratings_div = soup.find("span", class_="reviews-num")
        android_total_rating = int(total_ratings_div.text.replace(",", "").strip())
        
        # android_ratings_[1 through 5]
        for star_val in list(android_star_ratings.keys()):
            star_rating_span = soup.find("div", class_=f"rating-bar-container {star_val}").find_next("span", class_="bar-number")
            if star_rating_span:
                android_star_ratings[star_val].append(int(star_rating_span.text.replace(",", "").strip()))
            else:
                android_star_ratings[star_val].append(0)
        
        return {"android_avg_rating" : android_avg_rating,
                "android_total_ratings" : android_total_rating,
                "android_ratings_1" : android_star_ratings["one"][0],
                "android_ratings_2" : android_star_ratings["two"][0],
                "android_ratings_3" : android_star_ratings["three"][0],
                "android_ratings_4" : android_star_ratings["four"][0],
                "android_ratings_5" : android_star_ratings["five"][0]}

In [None]:
android_dict = {}

# pull parsed android html data
for folder in data_folders:
    for filename in os.listdir(data_path + folder):
        if filename.endswith("_android.html"):
            file_path = os.path.join(data_path, folder, filename)
            rating = parse_android_html(file_path)
            android_index = (folder + "/" + filename)
            if android_index not in android_dict: android_dict[android_index] = rating
            android_dict[android_index].update(parse_android_html(file_path))

# constructing data frame with values from parsed android html
android_data = pd.DataFrame.from_dict(android_dict, orient = "index")

In [10]:
android_data

Unnamed: 0,android_avg_rating,android_total_ratings,android_ratings_1,android_ratings_2,android_ratings_3,android_ratings_4,android_ratings_5
2016-07-21/00_00_pokemon_android.html,3.9,1281802,199974,71521,117754,165956,726597
2016-07-21/00_10_pokemon_android.html,3.9,1281802,199974,71521,117754,165956,726597
2016-07-21/00_20_pokemon_android.html,3.9,1281802,199974,71521,117754,165956,726597
2016-07-21/00_30_pokemon_android.html,3.9,1281802,199974,71521,117754,165956,726597
2016-07-21/00_40_pokemon_android.html,3.9,1281802,199974,71521,117754,165956,726597
...,...,...,...,...,...,...,...
2016-07-31/23_10_pokemon_android.html,3.9,1954991,302864,101244,173651,259919,1117313
2016-07-31/23_20_pokemon_android.html,3.9,1954991,302864,101244,173651,259919,1117313
2016-07-31/23_30_pokemon_android.html,3.9,1954991,302864,101244,173651,259919,1117313
2016-07-31/23_40_pokemon_android.html,3.9,1954991,302864,101244,173651,259919,1117313


In [11]:
android_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1584 entries, 2016-07-21/00_00_pokemon_android.html to 2016-07-31/23_50_pokemon_android.html
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   android_avg_rating     1584 non-null   float64
 1   android_total_ratings  1584 non-null   int64  
 2   android_ratings_1      1584 non-null   int64  
 3   android_ratings_2      1584 non-null   int64  
 4   android_ratings_3      1584 non-null   int64  
 5   android_ratings_4      1584 non-null   int64  
 6   android_ratings_5      1584 non-null   int64  
dtypes: float64(1), int64(6)
memory usage: 99.0+ KB


In [12]:
android_data.loc['2016-07-21/00_00_pokemon_android.html']

android_avg_rating             3.9
android_total_ratings    1281802.0
android_ratings_1         199974.0
android_ratings_2          71521.0
android_ratings_3         117754.0
android_ratings_4         165956.0
android_ratings_5         726597.0
Name: 2016-07-21/00_00_pokemon_android.html, dtype: float64

In [None]:
# Compare with expected values
print(all(android_data.loc['2016-07-21/00_00_pokemon_android.html'].values == [3.9, 1281802, 199974, 71521, 117754, 165956, 726597]))

True


## ‎

---

## ‎

<div style="background-image: url('./char_background_2.png'); 
            background-size: 85%;
            background-position: center;
            background-repeat: no-repeat; 
            color: black;
            margin: 0 auto; 
            padding: 40px; 
            text-align: center; 
            border-radius: 10px;"><h1 style="font-size: 54px;"><b><i>Part 2: Data Organization</i></b></h1></div>

- The next step is to organize the extracted values, so that we can do some data exploration. As we have time series data, we will organize the data by *datetime* (note that *datetime* is a Python data type).

## ‎

> <center><h1>🔹 🔹 🅐 🔹 🔹</h1></center>
>
> Using the extracted values from the previous step, create a Python dictionary, where the key is *datetime* object and the value is a dictionary with extracted values from iOS and Android HTML files. For example, for flies: 
>
> <center><b>“2016-07-21-00_00_pokemon_android.html”</b> and <b>“2016-07-21/00_00_pokemon_ios.html”</b></center>
>
> the key should be:
>
> <center><i>datetime(2016, 7, 21, 0, 0, 0)</i></center>
>
> and the value should be: 
> 
> <center><i>{ ‘ios_current_ratings’ : 4688, ‘ios_all_ratings’ : 106508, ‘android_avg_rating’ : 3.9, ‘android_total_ratings’ : 1281802, ‘android_rating_1’ : 199974, ‘android_rating_2’ : 71512, ‘android_rating_3’ : 117754, ‘android_rating_4’ : 165956, ‘android_rating_5’ : 726597}</i></center>

In [None]:
# directly constructing dict using previous helper functions parse_ios_html/parse_android_html
datetime_dict = {}

for folder in data_folders:
    for file_name in os.listdir(data_path + folder):
        if file_name.endswith("_ios.html") or file_name.endswith("_android.html"):
            
            # assigning specific datetime index from file_name
            time_str = file_name.split("_")[0:2]
            time_str = "-".join(time_str)
            date_str = os.path.basename(folder)
            date_time_str = f"{date_str}-{time_str}"
            try:
                dt_obj = datetime.strptime(date_time_str, "%Y-%m-%d-%H-%M")
            except ValueError:
                print(f"ValueError: Error parsing datetime from filename {file_name}")
                continue
            
            # ios handling
            file_path = os.path.join(data_path, folder, file_name)
            if file_name.endswith("_ios.html"):
                if dt_obj not in datetime_dict:
                    datetime_dict[dt_obj] = {}
                datetime_dict[dt_obj].update(parse_ios_html(file_path))
                    
            # android handling
            elif file_name.endswith("_android.html"):
                if dt_obj not in datetime_dict:
                    datetime_dict[dt_obj] = {}
                datetime_dict[dt_obj].update(parse_android_html(file_path))

In [None]:
# Compare with expected values
key = datetime(2016, 7, 21, 0, 0, 0)
checks = [
    datetime_dict[key]["ios_current_ratings"] == 4688,
    datetime_dict[key]["ios_all_ratings"] == 106508,
    datetime_dict[key]["android_avg_rating"] == 3.9,
    datetime_dict[key]["android_total_ratings"] == 1281802,
    datetime_dict[key]["android_ratings_1"] == 199974,
    datetime_dict[key]["android_ratings_2"] == 71521,
    datetime_dict[key]["android_ratings_3"] == 117754,
    datetime_dict[key]["android_ratings_4"] == 165956,
    datetime_dict[key]["android_ratings_5"] == 726597,
    all(isinstance(key, datetime) for key in datetime_dict.keys())
]
print(all(checks))

True


## ‎

> <center><h1>🔹 🔹 🅑 🔹 🔹</h1></center>
>
> Convert the dictionary into a Pandas *dataframe*, *pokemon_db*, where the index is *datetime* and columns are names of the extracted 9 iOS/Android values.

In [20]:
pokemon_db = pd.DataFrame.from_dict(datetime_dict, orient = "index")
pokemon_db

Unnamed: 0,android_avg_rating,android_total_ratings,android_ratings_1,android_ratings_2,android_ratings_3,android_ratings_4,android_ratings_5,ios_current_ratings,ios_all_ratings
2016-07-21 00:00:00,3.9,1281802,199974,71521,117754,165956,726597,4688.0,106508
2016-07-21 00:10:00,3.9,1281802,199974,71521,117754,165956,726597,4688.0,106508
2016-07-21 00:20:00,3.9,1281802,199974,71521,117754,165956,726597,4688.0,106508
2016-07-21 00:30:00,3.9,1281802,199974,71521,117754,165956,726597,4688.0,106508
2016-07-21 00:40:00,3.9,1281802,199974,71521,117754,165956,726597,4688.0,106508
...,...,...,...,...,...,...,...,...,...
2016-07-31 23:10:00,3.9,1954991,302864,101244,173651,259919,1117313,17856.0,139213
2016-07-31 23:20:00,3.9,1954991,302864,101244,173651,259919,1117313,22193.0,143350
2016-07-31 23:30:00,3.9,1954991,302864,101244,173651,259919,1117313,22193.0,143350
2016-07-31 23:40:00,3.9,1954991,302864,101244,173651,259919,1117313,22193.0,143350


In [23]:
# Compare with expected answer
checks = [
    isinstance(pokemon_db, pd.DataFrame), # checking if table is pandas dataframe
    isinstance(pokemon_db.index, pd.DatetimeIndex), # check if table index values are DatetimeIndex objects
    len(pokemon_db.columns) == 9, # should have exactly 9 cols
    list(pokemon_db.columns) == list(android_data.columns) + list(ios_data.columns)# ensuring got all cols from ios/android
    ]
print(all(checks))

True


## ‎

> <center><h1>🔹 🔹 🅒 🔹 🔹</h1></center>
>
> Save the dataframe into two formats (CSV and Excel). The file names should be 
>
> <center><b>"pokemon.csv"</b> and <b>"pokemon.xlsx"</b>.</center>

In [22]:
pokemon_db.to_csv("pokemon.csv")
pokemon_db.to_excel("pokemon.xlsx")

In [25]:
# confirm files were created
checks = [
    os.path.exists("pokemon.csv"),
    os.path.exists("pokemon.xlsx")
]
print(all(checks))

True


### ‎

---
---
---