# EDA Global Youtube Statistics 2023

## 1. Understand the dataset

### Potential use cases

In [None]:
"""

YouTube Analytics: Gain valuable insights into the success factors of top YouTube channels and understand what sets them 
apart from the rest.

Content Strategy: Discover the most popular categories and upload frequencies that resonate with audiences.

Regional Influencers: Identify influential YouTube creators from different countries and analyze their impact on a global scale.

Earnings Analysis: Explore the correlation between channel performance and estimated earnings.

Geospatial Visualization: Visualize the distribution of successful YouTube channels on a world map and uncover geographical 
trends.

Trending Topics: Investigate how certain categories gain popularity over time and correlate with world events.

"""

### Data attributes

In [None]:
"""
rank: Position of the YouTube channel based on the number of subscribers
Youtuber: Name of the YouTube channel
subscribers: Number of subscribers to the channel
video views: Total views across all videos on the channel
category: Category or niche of the channel
Title: Title of the YouTube channel
uploads: Total number of videos uploaded on the channel
Country: Country where the YouTube channel originates
Abbreviation: Abbreviation of the country
channel_type: Type of the YouTube channel (e.g., individual, brand)
video_views_rank: Ranking of the channel based on total video views
country_rank: Ranking of the channel based on the number of subscribers within its country
channel_type_rank: Ranking of the channel based on its type (individual or brand)
video_views_for_the_last_30_days: Total video views in the last 30 days
lowest_monthly_earnings: Lowest estimated monthly earnings from the channel
highest_monthly_earnings: Highest estimated monthly earnings from the channel
lowest_yearly_earnings: Lowest estimated yearly earnings from the channel
highest_yearly_earnings: Highest estimated yearly earnings from the channel
subscribers_for_last_30_days: Number of new subscribers gained in the last 30 days
created_year: Year when the YouTube channel was created
created_month: Month when the YouTube channel was created
created_date: Exact date of the YouTube channel's creation
Gross tertiary education enrollment (%): Percentage of the population enrolled in tertiary education in the country
Population: Total population of the country
Unemployment rate: Unemployment rate in the country
Urban_population: Percentage of the population living in urban areas
Latitude: Latitude coordinate of the country's location
Longitude: Longitude coordinate of the country's location

"""

## 2. Initial data exploration

In [132]:
# loads packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [133]:
# imports dataset & converts to a pandas dataframe (df)

df = pd.read_csv(r"C:\Users\shass\my_github\my_practice_repository\datasets\global_youtube_statistics_2023.csv")

# calls the df to review results

df

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,2.280000e+11,Music,T-Series,20082,India,IN,Music,1.0,1.0,1.0,2.258000e+09,564600.0,9000000.00,6800000.00,1.084000e+08,2000000.0,2006.0,Mar,13.0,28.1,1.366418e+09,5.36,471031528.0,20.593684,78.962880
1,2,YouTube Movies,170000000,0.000000e+00,Film & Animation,youtubemovies,1,United States,US,Games,4055159.0,7670.0,7423.0,1.200000e+01,0.0,0.05,0.04,5.800000e-01,,2006.0,Mar,5.0,88.2,3.282395e+08,14.70,270663028.0,37.090240,-95.712891
2,3,MrBeast,166000000,2.836884e+10,Entertainment,MrBeast,741,United States,US,Entertainment,48.0,1.0,1.0,1.348000e+09,337000.0,5400000.00,4000000.00,6.470000e+07,8000000.0,2012.0,Feb,20.0,88.2,3.282395e+08,14.70,270663028.0,37.090240,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,1.640000e+11,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2.0,2.0,1.0,1.975000e+09,493800.0,7900000.00,5900000.00,9.480000e+07,1000000.0,2006.0,Sep,1.0,88.2,3.282395e+08,14.70,270663028.0,37.090240,-95.712891
4,5,SET India,159000000,1.480000e+11,Shows,SET India,116536,India,IN,Entertainment,3.0,2.0,2.0,1.824000e+09,455900.0,7300000.00,5500000.00,8.750000e+07,1000000.0,2006.0,Sep,20.0,28.1,1.366418e+09,5.36,471031528.0,20.593684,78.962880
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,991,Natan por Aï¿,12300000,9.029610e+09,Sports,Natan por Aï¿,1200,Brazil,BR,Entertainment,525.0,55.0,172.0,5.525130e+08,138100.0,2200000.00,1700000.00,2.650000e+07,700000.0,2017.0,Feb,12.0,51.3,2.125594e+08,12.08,183241641.0,-14.235004,-51.925280
991,992,Free Fire India Official,12300000,1.674410e+09,People & Blogs,Free Fire India Official,1500,India,IN,Games,6141.0,125.0,69.0,6.473500e+07,16200.0,258900.00,194200.00,3.100000e+06,300000.0,2018.0,Sep,14.0,28.1,1.366418e+09,5.36,471031528.0,20.593684,78.962880
992,993,Panda,12300000,2.214684e+09,,HybridPanda,2452,United Kingdom,GB,Games,129005.0,867.0,1202.0,6.703500e+04,17.0,268.00,201.00,3.200000e+03,1000.0,2006.0,Sep,11.0,60.0,6.683440e+07,3.85,55908316.0,55.378051,-3.435973
993,994,RobTopGames,12300000,3.741235e+08,Gaming,RobTopGames,39,Sweden,SE,Games,35112.0,4.0,69.0,3.871000e+06,968.0,15500.00,11600.00,1.858000e+05,100000.0,2012.0,May,9.0,67.0,1.028545e+07,6.48,9021165.0,60.128161,18.643501


In [134]:
# set the display.max_columns option to None to display all columns of the df

pd.set_option('display.max_columns', None)

In [135]:
# returns the first 10 rows of the df

df.head(10)

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,1.0,1.0,1.0,2258000000.0,564600.0,9000000.0,6800000.0,108400000.0,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,4055159.0,7670.0,7423.0,12.0,0.0,0.05,0.04,0.58,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,48.0,1.0,1.0,1348000000.0,337000.0,5400000.0,4000000.0,64700000.0,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2.0,2.0,1.0,1975000000.0,493800.0,7900000.0,5900000.0,94800000.0,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,3.0,2.0,2.0,1824000000.0,455900.0,7300000.0,5500000.0,87500000.0,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
5,6,Music,119000000,0.0,,Music,0,,,Music,4057944.0,,,,0.0,0.0,0.0,0.0,,2013.0,Sep,24.0,,,,,,
6,7,ýýý Kids Diana Show,112000000,93247040000.0,People & Blogs,ýýý Kids Diana Show,1111,United States,US,Entertainment,5.0,3.0,3.0,731674000.0,182900.0,2900000.0,2200000.0,35100000.0,,2015.0,May,12.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
7,8,PewDiePie,111000000,29058040000.0,Gaming,PewDiePie,4716,Japan,JP,Entertainment,44.0,1.0,4.0,39184000.0,9800.0,156700.0,117600.0,1900000.0,,2010.0,Apr,29.0,63.2,126226600.0,2.29,115782416.0,36.204824,138.252924
8,9,Like Nastya,106000000,90479060000.0,People & Blogs,Like Nastya Vlog,493,Russia,RU,People,630.0,5.0,25.0,48947000.0,12200.0,195800.0,146800.0,2300000.0,100000.0,2016.0,Jan,14.0,81.9,144373500.0,4.59,107683889.0,61.52401,105.318756
9,10,Vlad and Niki,98900000,77180170000.0,Entertainment,Vlad and Niki,574,United States,US,Entertainment,8.0,5.0,6.0,580574000.0,145100.0,2300000.0,1700000.0,27900000.0,600000.0,2018.0,Apr,23.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891


In [136]:
# displays concise information of the df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

In [137]:
# generates descriptive statistics of the df

df.describe()

Unnamed: 0,rank,subscribers,video views,uploads,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
count,995.0,995.0,995.0,995.0,994.0,879.0,962.0,939.0,995.0,995.0,995.0,995.0,658.0,990.0,990.0,872.0,872.0,872.0,872.0,872.0,872.0
mean,498.0,22982410.0,11039540000.0,9187.125628,554248.9,386.05347,745.719335,175610300.0,36886.148281,589807.8,442257.4,7081814.0,349079.1,2012.630303,15.746465,63.627752,430387300.0,9.279278,224215000.0,26.632783,-14.128146
std,287.37606,17526110.0,14110840000.0,34151.352254,1362782.0,1232.244746,1944.386561,416378200.0,71858.724092,1148622.0,861216.1,13797040.0,614355.4,4.512503,8.77752,26.106893,472794700.0,4.888354,154687400.0,20.560533,84.760809
min,1.0,12300000.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1970.0,1.0,7.6,202506.0,0.75,35588.0,-38.416097,-172.104629
25%,249.5,14500000.0,4288145000.0,194.5,323.0,11.0,27.0,20137500.0,2700.0,43500.0,32650.0,521750.0,100000.0,2009.0,8.0,36.3,83355410.0,5.27,55908320.0,20.593684,-95.712891
50%,498.0,17700000.0,7760820000.0,729.0,915.5,51.0,65.5,64085000.0,13300.0,212700.0,159500.0,2600000.0,200000.0,2013.0,16.0,68.0,328239500.0,9.365,270663000.0,37.09024,-51.92528
75%,746.5,24600000.0,13554700000.0,2667.5,3584.5,123.0,139.75,168826500.0,37900.0,606800.0,455100.0,7300000.0,400000.0,2016.0,23.0,88.2,328239500.0,14.7,270663000.0,37.09024,78.96288
max,995.0,245000000.0,228000000000.0,301308.0,4057944.0,7741.0,7741.0,6589000000.0,850900.0,13600000.0,10200000.0,163400000.0,8000000.0,2022.0,31.0,113.1,1397715000.0,14.72,842934000.0,61.92411,138.252924


## Data Cleaning

### Check that we need all columns

In [None]:
# in this df all the columns are useful so will be kept & not dropped

### Check for duplicated rows

In [138]:
# uses the duplicated() method to mark duplicate rows

duplicates = df.duplicated()

# filters the DataFrame using the Boolean Series to examine the duplicates

duplicate_rows = df[duplicates]

# result is that no duplicate rows exist

print(duplicate_rows) 

Empty DataFrame
Columns: [rank, Youtuber, subscribers, video views, category, Title, uploads, Country, Abbreviation, channel_type, video_views_rank, country_rank, channel_type_rank, video_views_for_the_last_30_days, lowest_monthly_earnings, highest_monthly_earnings, lowest_yearly_earnings, highest_yearly_earnings, subscribers_for_last_30_days, created_year, created_month, created_date, Gross tertiary education enrollment (%), Population, Unemployment rate, Urban_population, Latitude, Longitude]
Index: []


### Change data types

In [139]:
# fills missing values in specified columns of the df with 0, and updates the df in place

df.fillna({
    'video views': 0,
    'video_views_rank': 0,
    'country_rank': 0,
    'channel_type_rank': 0,
    'video_views_for_the_last_30_days': 0,
    'subscribers_for_last_30_days': 0,
    'created_year': 0,
    'Population': 0,
    'Urban_population': 0
}, inplace=True)


# converts the data types of specified columns in the DataFrame df to int64

df = df.astype({
    'video views': 'int64',
    'video_views_rank': 'int64',
    'country_rank': 'int64',
    'channel_type_rank': 'int64',
    'video_views_for_the_last_30_days': 'int64',
    'subscribers_for_last_30_days': 'int64',
    'created_year': 'int64',
    'Population': 'int64',
    'Urban_population': 'int64'
})

In [140]:
# displays concise information of the df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    int64  
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         995 non-n

### Identify missing values

In [141]:
# counts the number of missing values (NaN) in each column of df

missing_values_count = df.isnull().sum()
missing_values_count[missing_values_count > 0]

category                                    46
Country                                    122
Abbreviation                               122
channel_type                                30
created_month                                5
created_date                                 5
Gross tertiary education enrollment (%)    123
Unemployment rate                          123
Latitude                                   123
Longitude                                  123
dtype: int64

In [143]:
# replace missing values in categorical columns with "Unknown"
categorical_columns = df.select_dtypes(include=['object']).columns
df[categorical_columns] = df[categorical_columns].fillna("Unknown")

# replace missing values in numerical columns with 0
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# exclude Latitude and Longitude columns
numerical_columns = numerical_columns.difference(['Latitude', 'Longitude']) 

df[numerical_columns] = df[numerical_columns].fillna(0)

# xheck if there are any missing values remaining in the dataset
remaining_missing_values = df.isnull().sum().sum()
remaining_missing_values

246

In [144]:
df.head(20)

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000,Music,T-Series,20082,India,IN,Music,1,1,1,2258000000,564600.0,9000000.0,6800000.0,108400000.0,2000000,2006,Mar,13.0,28.1,1366417754,5.36,471031528,20.593684,78.96288
1,2,YouTube Movies,170000000,0,Film & Animation,youtubemovies,1,United States,US,Games,4055159,7670,7423,12,0.0,0.05,0.04,0.58,0,2006,Mar,5.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
2,3,MrBeast,166000000,28368841870,Entertainment,MrBeast,741,United States,US,Entertainment,48,1,1,1348000000,337000.0,5400000.0,4000000.0,64700000.0,8000000,2012,Feb,20.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2,2,1,1975000000,493800.0,7900000.0,5900000.0,94800000.0,1000000,2006,Sep,1.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
4,5,SET India,159000000,148000000000,Shows,SET India,116536,India,IN,Entertainment,3,2,2,1824000000,455900.0,7300000.0,5500000.0,87500000.0,1000000,2006,Sep,20.0,28.1,1366417754,5.36,471031528,20.593684,78.96288
5,6,Music,119000000,0,Unknown,Music,0,Unknown,Unknown,Music,4057944,0,0,0,0.0,0.0,0.0,0.0,0,2013,Sep,24.0,0.0,0,0.0,0,,
6,7,ýýý Kids Diana Show,112000000,93247040539,People & Blogs,ýýý Kids Diana Show,1111,United States,US,Entertainment,5,3,3,731674000,182900.0,2900000.0,2200000.0,35100000.0,0,2015,May,12.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
7,8,PewDiePie,111000000,29058044447,Gaming,PewDiePie,4716,Japan,JP,Entertainment,44,1,4,39184000,9800.0,156700.0,117600.0,1900000.0,0,2010,Apr,29.0,63.2,126226568,2.29,115782416,36.204824,138.252924
8,9,Like Nastya,106000000,90479060027,People & Blogs,Like Nastya Vlog,493,Russia,RU,People,630,5,25,48947000,12200.0,195800.0,146800.0,2300000.0,100000,2016,Jan,14.0,81.9,144373535,4.59,107683889,61.52401,105.318756
9,10,Vlad and Niki,98900000,77180169894,Entertainment,Vlad and Niki,574,United States,US,Entertainment,8,5,6,580574000,145100.0,2300000.0,1700000.0,27900000.0,600000,2018,Apr,23.0,88.2,328239523,14.7,270663028,37.09024,-95.712891


### Remove unwanted rows

#### Remove channels which don't have any video views

In [154]:
# identifying channels with 0 views which are YouTube topics & not relevant for analysis so can be dropped

filtered_df = df[df['video views'] == 0]
filtered_df

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude


In [153]:
# removes rows from the df where there are '0' video views

for x in df.index:
    if df.loc[x, 'video views'] == '0':
        df.drop(x, inplace = True)

In [155]:
# reviews the results

filtered_df = df[df['video views'] == 0]
filtered_df

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude


In [162]:
df.head(15)

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000,Music,T-Series,20082,India,IN,Music,1,1,1,2258000000,564600.0,9000000.0,6800000.0,108400000.0,2000000,2006,Mar,13.0,28.1,1366417754,5.36,471031528,20.593684,78.96288
2,3,MrBeast,166000000,28368841870,Entertainment,MrBeast,741,United States,US,Entertainment,48,1,1,1348000000,337000.0,5400000.0,4000000.0,64700000.0,8000000,2012,Feb,20.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2,2,1,1975000000,493800.0,7900000.0,5900000.0,94800000.0,1000000,2006,Sep,1.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
4,5,SET India,159000000,148000000000,Shows,SET India,116536,India,IN,Entertainment,3,2,2,1824000000,455900.0,7300000.0,5500000.0,87500000.0,1000000,2006,Sep,20.0,28.1,1366417754,5.36,471031528,20.593684,78.96288
6,7,Kids Diana Show,112000000,93247040539,People & Blogs,Kids Diana Show,1111,United States,US,Entertainment,5,3,3,731674000,182900.0,2900000.0,2200000.0,35100000.0,0,2015,May,12.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
7,8,PewDiePie,111000000,29058044447,Gaming,PewDiePie,4716,Japan,JP,Entertainment,44,1,4,39184000,9800.0,156700.0,117600.0,1900000.0,0,2010,Apr,29.0,63.2,126226568,2.29,115782416,36.204824,138.252924
8,9,Like Nastya,106000000,90479060027,People & Blogs,Like Nastya Vlog,493,Russia,RU,People,630,5,25,48947000,12200.0,195800.0,146800.0,2300000.0,100000,2016,Jan,14.0,81.9,144373535,4.59,107683889,61.52401,105.318756
9,10,Vlad and Niki,98900000,77180169894,Entertainment,Vlad and Niki,574,United States,US,Entertainment,8,5,6,580574000,145100.0,2300000.0,1700000.0,27900000.0,600000,2018,Apr,23.0,88.2,328239523,14.7,270663028,37.09024,-95.712891
10,11,Zee Music Company,96700000,57856289381,Music,Zee Music Company,8548,India,IN,Music,12,3,2,803613000,200900.0,3200000.0,2400000.0,38600000.0,1100000,2014,Mar,12.0,28.1,1366417754,5.36,471031528,20.593684,78.96288
11,12,WWE,96000000,77428473662,Sports,WWE,70127,United States,US,Sports,7,6,1,714614000,178700.0,2900000.0,2100000.0,34300000.0,600000,2007,May,11.0,88.2,328239523,14.7,270663028,37.09024,-95.712891


### Clean & remove unwanted characters

In [157]:
# matches any character that is NOT: a letter, a number, a whitespace character, a standard punctuation # mark (.,!?) a hyphen, 
# an apostrophe, and an ampersand

pattern = r'[^a-zA-Z0-9\s.,!?&\'-]'

In [158]:
# filters the 'Youtuber' column for rows which contain strings with unwanted characters
filtered_rows_youtuber = df[df['Youtuber'].str.contains(pattern, regex=True)]

# reviews the results
filtered_rows_youtuber

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
6,7,ýýý Kids Diana Show,112000000,93247040539,People & Blogs,ýýý Kids Diana Show,1111,United States,US,Entertainment,5,3,3,731674000,182900.0,2900000.0,2200000.0,35100000.0,0,2015,May,12.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
55,56,Vocï¿½ï¿½ï¿½,44700000,7828610828,Entertainment,Vocï¿½ï¿½ï¿½,1558,Brazil,BR,Entertainment,681,3,15,48032000,12000.0,192100.0,144100.0,2300000.0,100000,2013,Sep,1.0,51.3,212559417,12.08,183241641,-14.235004,-51.925280
64,65,ýýýýýýýý ýý ýýýýýýýýýýýýýý,43200000,36458726976,Film & Animation,ýýýýýýýý ýý ýýýýýýýýýýýýýý,1478,Russia,RU,Education,26,2,6,303780000,75900.0,1200000.0,911300.0,14600000.0,300000,2011,May,31.0,81.9,144373535,4.59,107683889,61.524010,105.318756
76,77,shfa2 - ï¿½ï¿½,39700000,23884824160,People & Blogs,shfa2 - ï¿½ï¿½,1596,United Arab Emirates,AE,People,81,1,2,247731000,61900.0,990900.0,743200.0,11900000.0,300000,2017,Nov,6.0,36.8,9770529,2.35,8479744,23.424076,53.847818
91,92,Vlad vï¿½ï¿½ï,37900000,23510152352,Unknown,Vlad vï¿½ï¿½ï,515,United States,US,Entertainment,84,28,27,244093000,61000.0,976400.0,732300.0,11700000.0,200000,2018,Jul,20.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,971,_vector_,12400000,7597013023,Comedy,_vector_,398,United States,US,Comedy,720,176,43,903672000,225900.0,3600000.0,2700000.0,43400000.0,1200000,2019,Mar,24.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
974,975,Gibby :),12400000,2862685032,People & Blogs,Gibby :),226,Mexico,MX,People,3087,34,63,10278000,2600.0,41100.0,30800.0,493300.0,0,2014,Aug,30.0,40.2,126014024,3.42,102626859,23.634501,-102.552784
975,976,Gustavo Parï¿½ï¿½,12400000,2602614088,Comedy,GustavoParodias,9,Brazil,BR,Comedy,4050768,5075,4894,0,0.0,0.0,0.0,0.0,0,2010,Aug,24.0,51.3,212559417,12.08,183241641,-14.235004,-51.925280
979,980,DaniRep | +6 Vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï,12400000,6933660906,Gaming,DaniRep | +6 Vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï,12419,Spain,ES,Games,847,19,68,82648000,20700.0,330600.0,247900.0,4000000.0,100000,2012,Oct,29.0,88.9,47076781,13.96,37927409,40.463667,-3.749220


In [159]:
# filters the 'Title' column for rows which contain strings with unwanted characters

filtered_rows_title = df[df['Title'].str.contains(pattern, regex=True)]

# reviews the results

filtered_rows_title

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
6,7,ýýý Kids Diana Show,112000000,93247040539,People & Blogs,ýýý Kids Diana Show,1111,United States,US,Entertainment,5,3,3,731674000,182900.00,2900000.00,2200000.00,35100000.0,0,2015,May,12.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
38,39,LooLoo Kids - Nursery Rhymes and Children's Songs,54000000,32312431239,Music,LooLoo Kids - Nursery Rhymes and Children's ï¿½,11,Unknown,Unknown,Unknown,3800129,0,0,159,0.04,0.64,0.48,8.0,0,2016,Nov,29.0,0.0,0,0.00,0,,
43,44,BillionSurpriseToys - Nursery Rhymes & Cartoons,52200000,9877365274,Education,BillionSurpriseToys - Nursery Rhymes & Cartï¿½,847,United States,US,Education,450,15,5,266747000,66700.00,1100000.00,800200.00,12800000.0,600000,2013,Oct,25.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
55,56,Vocï¿½ï¿½ï¿½,44700000,7828610828,Entertainment,Vocï¿½ï¿½ï¿½,1558,Brazil,BR,Entertainment,681,3,15,48032000,12000.00,192100.00,144100.00,2300000.0,100000,2013,Sep,1.0,51.3,212559417,12.08,183241641,-14.235004,-51.925280
64,65,ýýýýýýýý ýý ýýýýýýýýýýýýýý,43200000,36458726976,Film & Animation,ýýýýýýýý ýý ýýýýýýýýýýýýýý,1478,Russia,RU,Education,26,2,6,303780000,75900.00,1200000.00,911300.00,14600000.0,300000,2011,May,31.0,81.9,144373535,4.59,107683889,61.524010,105.318756
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
969,970,Quantum Tech HD,12500000,4340213066,Science & Technology,Mr_Mughall Gaming,223,Pakistan,PK,Unknown,3956586,3554,0,8721,2.00,35.00,26.00,419.0,32,2022,Apr,23.0,9.0,216565318,4.45,79927762,30.375321,69.345116
970,971,_vector_,12400000,7597013023,Comedy,_vector_,398,United States,US,Comedy,720,176,43,903672000,225900.00,3600000.00,2700000.00,43400000.0,1200000,2019,Mar,24.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
974,975,Gibby :),12400000,2862685032,People & Blogs,Gibby :),226,Mexico,MX,People,3087,34,63,10278000,2600.00,41100.00,30800.00,493300.0,0,2014,Aug,30.0,40.2,126014024,3.42,102626859,23.634501,-102.552784
979,980,DaniRep | +6 Vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï,12400000,6933660906,Gaming,DaniRep | +6 Vï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï,12419,Spain,ES,Games,847,19,68,82648000,20700.00,330600.00,247900.00,4000000.0,100000,2012,Oct,29.0,88.9,47076781,13.96,37927409,40.463667,-3.749220


In [160]:
# Replaces non-alphanumeric characters (excluding spaces, punctuation marks, and special characters listed) in the 'Youtuber' & 
# 'Title'columns in the df with an empty string

df['Youtuber'] = df['Youtuber'].str.replace('[^a-zA-Z0-9\s.,!?&\'-]', '')
df['Title'] = df['Title'].str.replace('[^a-zA-Z0-9\s.,!?&\'-]', '')

df

  df['Youtuber'] = df['Youtuber'].str.replace('[^a-zA-Z0-9\s.,!?&\'-]', '')
  df['Title'] = df['Title'].str.replace('[^a-zA-Z0-9\s.,!?&\'-]', '')


Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000,Music,T-Series,20082,India,IN,Music,1,1,1,2258000000,564600.0,9000000.0,6800000.0,108400000.0,2000000,2006,Mar,13.0,28.1,1366417754,5.36,471031528,20.593684,78.962880
2,3,MrBeast,166000000,28368841870,Entertainment,MrBeast,741,United States,US,Entertainment,48,1,1,1348000000,337000.0,5400000.0,4000000.0,64700000.0,8000000,2012,Feb,20.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2,2,1,1975000000,493800.0,7900000.0,5900000.0,94800000.0,1000000,2006,Sep,1.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
4,5,SET India,159000000,148000000000,Shows,SET India,116536,India,IN,Entertainment,3,2,2,1824000000,455900.0,7300000.0,5500000.0,87500000.0,1000000,2006,Sep,20.0,28.1,1366417754,5.36,471031528,20.593684,78.962880
6,7,Kids Diana Show,112000000,93247040539,People & Blogs,Kids Diana Show,1111,United States,US,Entertainment,5,3,3,731674000,182900.0,2900000.0,2200000.0,35100000.0,0,2015,May,12.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,991,Natan por A,12300000,9029609749,Sports,Natan por A,1200,Brazil,BR,Entertainment,525,55,172,552513000,138100.0,2200000.0,1700000.0,26500000.0,700000,2017,Feb,12.0,51.3,212559417,12.08,183241641,-14.235004,-51.925280
991,992,Free Fire India Official,12300000,1674409945,People & Blogs,Free Fire India Official,1500,India,IN,Games,6141,125,69,64735000,16200.0,258900.0,194200.0,3100000.0,300000,2018,Sep,14.0,28.1,1366417754,5.36,471031528,20.593684,78.962880
992,993,Panda,12300000,2214684303,Unknown,HybridPanda,2452,United Kingdom,GB,Games,129005,867,1202,67035,17.0,268.0,201.0,3200.0,1000,2006,Sep,11.0,60.0,66834405,3.85,55908316,55.378051,-3.435973
993,994,RobTopGames,12300000,374123483,Gaming,RobTopGames,39,Sweden,SE,Games,35112,4,69,3871000,968.0,15500.0,11600.0,185800.0,100000,2012,May,9.0,67.0,10285453,6.48,9021165,60.128161,18.643501


### Setting the index

In [163]:
# sets the df index to 'rank' as it's the natural index of this dataset

df.set_index('rank', inplace = True)

df

Unnamed: 0_level_0,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
1,T-Series,245000000,228000000000,Music,T-Series,20082,India,IN,Music,1,1,1,2258000000,564600.0,9000000.0,6800000.0,108400000.0,2000000,2006,Mar,13.0,28.1,1366417754,5.36,471031528,20.593684,78.962880
3,MrBeast,166000000,28368841870,Entertainment,MrBeast,741,United States,US,Entertainment,48,1,1,1348000000,337000.0,5400000.0,4000000.0,64700000.0,8000000,2012,Feb,20.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
4,Cocomelon - Nursery Rhymes,162000000,164000000000,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,2,2,1,1975000000,493800.0,7900000.0,5900000.0,94800000.0,1000000,2006,Sep,1.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
5,SET India,159000000,148000000000,Shows,SET India,116536,India,IN,Entertainment,3,2,2,1824000000,455900.0,7300000.0,5500000.0,87500000.0,1000000,2006,Sep,20.0,28.1,1366417754,5.36,471031528,20.593684,78.962880
7,Kids Diana Show,112000000,93247040539,People & Blogs,Kids Diana Show,1111,United States,US,Entertainment,5,3,3,731674000,182900.0,2900000.0,2200000.0,35100000.0,0,2015,May,12.0,88.2,328239523,14.70,270663028,37.090240,-95.712891
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
991,Natan por A,12300000,9029609749,Sports,Natan por A,1200,Brazil,BR,Entertainment,525,55,172,552513000,138100.0,2200000.0,1700000.0,26500000.0,700000,2017,Feb,12.0,51.3,212559417,12.08,183241641,-14.235004,-51.925280
992,Free Fire India Official,12300000,1674409945,People & Blogs,Free Fire India Official,1500,India,IN,Games,6141,125,69,64735000,16200.0,258900.0,194200.0,3100000.0,300000,2018,Sep,14.0,28.1,1366417754,5.36,471031528,20.593684,78.962880
993,Panda,12300000,2214684303,Unknown,HybridPanda,2452,United Kingdom,GB,Games,129005,867,1202,67035,17.0,268.0,201.0,3200.0,1000,2006,Sep,11.0,60.0,66834405,3.85,55908316,55.378051,-3.435973
994,RobTopGames,12300000,374123483,Gaming,RobTopGames,39,Sweden,SE,Games,35112,4,69,3871000,968.0,15500.0,11600.0,185800.0,100000,2012,May,9.0,67.0,10285453,6.48,9021165,60.128161,18.643501


## Summary statistics

In [165]:
# Generate summary statistics for numerical columns

numerical_summary = df.describe()
numerical_summary

Unnamed: 0,subscribers,video views,uploads,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,lowest_yearly_earnings,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
count,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,987.0,870.0,870.0
mean,22601110.0,11129020000.0,9261.583587,525492.1,329.690983,708.907801,167070000.0,37185.124154,594588.5,445842.1,7139215.0,232415.5,2002.439716,15.669706,56.035461,379575700.0,8.168318,197542200.0,26.608743,-13.940595
std,16436700.0,14132750000.0,34279.551177,1331031.0,1129.903345,1901.329135,407871400.0,72072.487477,1152039.0,863777.3,13838080.0,527874.1,143.029136,8.83109,31.992958,465658400.0,5.480071,162438300.0,20.578051,84.767773
min,12300000.0,2634.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-38.416097,-172.104629
25%,14500000.0,4344888000.0,205.5,311.0,5.5,24.0,14358500.0,2900.0,45850.0,34400.0,549750.0,0.0,2009.0,8.0,28.1,51709100.0,3.85,40827300.0,20.593684,-95.712891
50%,17700000.0,7776706000.0,741.0,906.0,34.0,62.0,57809000.0,13500.0,216500.0,162400.0,2600000.0,100000.0,2013.0,16.0,60.0,270203900.0,5.36,183241600.0,37.09024,-51.92528
75%,24300000.0,13588020000.0,2725.5,3348.0,114.0,137.0,159986500.0,38250.0,612200.0,459150.0,7350000.0,200000.0,2016.0,23.0,88.2,328239500.0,14.7,270663000.0,37.09024,78.96288
max,245000000.0,228000000000.0,301308.0,4057944.0,7741.0,7741.0,6589000000.0,850900.0,13600000.0,10200000.0,163400000.0,8000000.0,2022.0,31.0,113.1,1397715000.0,14.72,842934000.0,61.92411,138.252924


In [167]:
# Generate summary statistics for categorical columns

# count: The number of non-null entries in each column
# unique: The number of unique categories in each column
# top: The most frequent category in each column
# freq: The frequency of the most frequent category in each column

categorical_summary = df[categorical_columns].describe(include=['object'])
categorical_summary

Unnamed: 0,Youtuber,category,Title,Country,Abbreviation,channel_type,created_month
count,987.0,987,987.0,987,987,987,987
unique,973.0,19,970.0,50,50,15,13
top,,Entertainment,,United States,US,Entertainment,Jan
freq,6.0,241,6.0,311,311,303,99


In [None]:
# does some youtubers have more than one channel?
# the top youtubers come from only 50 countries
# what's the difference between category & channel_type 