# **EDA - Top 8800 Twitch Streamers**

### **Description**

This dataset contains the Top 8800 Twitch Streamers compiled by [GiRLaZo] (https://www.twitch.tv/girlazo).

Legend:
- **profile picture** - Profile picture link
- **top count** - Ranking of the 8,800 top steamers in descending order
- **screen name** - Straemer name
- **watch time** - The time a streamer has been seen in minutes
- **stream time** - Broadcast time of a streamer in minutes
- **peak viewers** - Maximum viewers
- **average viewers** - Average viewers
- **followers gained** - Followers gained in the last 365 days
- **views gained** - Viewers gained in the last 365 days
- **partnered** - They are associated with Twitch
- **mature** - Content is for 18+
- **language** - The language of the streamer
- **complete name** - Streamer name
- **first category** - The main category where the streamer broadcasts
- **second category** - The second category where the streamer broadcasts
- **third category** - The third category where the streamer broadcasts

**All categories are within a 365 day interval**


### **Objective**

- Summarize the data with descriptive statistics.
- Creation of a dashboard that implements EDA and Machine Learning
- Obtaining the data from the Twitch API
- ETL with PySpark (DataLake)
- Regression model to try to recommend the category or game that gives you the most followers or views
- Mongo + Cloud + AWS S3
- Kubeflow


### **Outline**

- Descriptive analysis

### Importing libraries

In [None]:
!pip install dash==1.19.0

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
#import statsmodels.api as sm 
import seaborn as sns
import plotly.express as px
import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.graph_objects as go
from dash.dependencies import Input, Output
%matplotlib inline

### Data loading and overview

In [None]:
df = pd.read_csv('../input/top-8800-twitch-streamers/TwitchDataSet.csv')
df["stream time"] = df["stream time"] // 60 #Transformar a horas
df["watch time"] = df["watch time"] // 60

In [None]:
df.head()

In [None]:
df.info()

In [None]:
cv = [col for col in df.columns if df[col].dtype == 'O']
nv = [col for col in df.columns if df[col].dtype != 'O']
print("{} categorical variables: \n{} \n\n {} numeric variables: \n{}"\
      .format(len(cv),cv, len(nv),nv))

In [None]:
print('NaN: \n\n{}'.format(df.isnull().sum()))

In [None]:
# Drop columns
df = df.drop(columns= ["profile picture", "completa name"])#labels2 = df['screen name'].iloc[0:10]
df['language'] = df['language'].replace(np.nan, '', regex=True) # remplazo NaN

In [None]:
fig = go.Figure(data=[go.Table(
    header=dict(values=list(df.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[df['top count'], df['screen name'], df['watch time'], df['stream time'], df['peak viewers'], df['average viewers'], df['followers'], df['followers gained'], df['views gained'],
       df['partnered'], df['mature'], df['language'], df['first category'], df['second category'], df['third category']],
               fill_color='lavender',
               align='left'))
])

fig.show()

In [None]:
df.describe().transpose()

## Correlation

In [None]:
import seaborn as sn
plt.figure(figsize=(10,10))
sn.heatmap(df.corr(), annot=True)
plt.show()

# Time

In [None]:
df.loc[:,['watch time']].describe().transpose()

In [None]:
import plotly.graph_objects as go

table_time = df[['top count','screen name','watch time','stream time']]

fig = go.Figure(data=[go.Table(
    header=dict(values=list(table_time.columns),
                fill_color='paleturquoise',
                align='left'),
    cells=dict(values=[table_time['top count'], table_time['screen name'], table_time['watch time'], table_time['stream time']],
               fill_color='lavender',
               align='left'))
])

fig.show()

In [None]:
fig = px.bar(df, x=df['screen name'].iloc[0:50], y=df['watch time'].iloc[0:50], \
             title='Top 50 streamers with the most viewing hours')
fig.update_xaxes(
        title_text = "Streamers",
        title_font = {"size": 15})
fig.update_yaxes(
        title_text = "Hours")
fig.show()

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

labels = df['screen name'].iloc[0:20]

fig = make_subplots(1, 2, specs=[[{'type':'domain'}, {'type':'domain'}]],
                    subplot_titles=['Watch time', 'Stream time'])
fig.add_trace(go.Pie(labels=labels, values=df['watch time'].iloc[0:20], scalegroup='one',
                     name="Viewing hours"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=df['stream time'].iloc[0:20], scalegroup='two',
                     name="Hours of stream"), 1, 2)

fig.update_layout(title_text='Hours of view of a stream VS hours of stream of the Top 20')
fig.show()

In [None]:
tab_st = df.sort_values(by=['stream time'],ascending=False)
fig = px.bar(tab_st, x=tab_st['screen name'].iloc[0:100], y=tab_st['stream time'].iloc[0:100], \
             title='Top 100 streamers with the most streaming hours')
fig.update_xaxes(
        title_text = "Streamers",
        title_font = {"size": 15})
fig.update_yaxes(
        title_text = "Hours")
fig.show()

In [None]:
name= tab_st['screen name'].iloc[0:1]
horas = tab_st['stream time'].iloc[0:1]
day = horas//24
print('The streamer with the most streaming hours is {} with {} hours ({} days).'.format(name.values,horas.values, day.values))

### Category

In [None]:
print("We have {} categories".format(df['first category'].nunique()))

In [None]:
tab_fc = pd.crosstab(index=df['first category'], columns='top count')
tab_fc = tab_fc.sort_values(by=['top count'],ascending=False)

In [None]:
fig = px.bar(tab_fc.iloc[0:20], x=tab_fc.iloc[0:20].index, y='top count', \
             title='Top 20 categories most chosen by the Top 8800 streamers')
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Category",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Top streamers",)
        #title_standoff = 10)
fig.show()


In [None]:
tab_fc2 = pd.crosstab(index=df['second category'], columns='top count')
tab_fc2 = tab_fc2.sort_values(by=['top count'],ascending=False)

In [None]:
fig = px.bar(tab_fc2.iloc[0:20], x=tab_fc2.iloc[0:20].index, y='top count', \
             title='Top 20 second most chosen category by the Top 8800 streamers')
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Category",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Top streamers",)
        #title_standoff = 10)
fig.show()

In [None]:
tab_fc3 = pd.crosstab(index=df['third category'], columns='top count')
tab_fc3 = tab_fc3.sort_values(by=['top count'],ascending=False)

In [None]:
fig = px.bar(tab_fc3.iloc[0:20], x=tab_fc3.iloc[0:20].index, y='top count', \
             title='Top 20 third most chosen category by the Top 8800 streamers')
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Category",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Top streamers",)
        #title_standoff = 10)
fig.show()

## Followers and languages

In [None]:
df.loc[:,['followers', 'peak viewers']].describe().transpose()

In [None]:
import plotly.express as px
fig = px.scatter(df, x="top count", y="top count",
         size="followers", color="language",
                 hover_name="screen name", log_x=True, size_max=50, \
                 title='Top streamers by followers and languages')
fig.update_yaxes(autorange="reversed")
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Top streamers from 1-8800",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "8800-1 Top streamers ",)
        #title_standoff = 10)
fig.show()

In [None]:
import plotly.express as px
fig = px.scatter(df, x="top count", y="peak viewers",
         size="peak viewers", color="language",
                 hover_name="screen name", log_x=True, size_max=50, \
                 title='Peak viewers in the Top 8800')
fig.update_yaxes(autorange="reversed")
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Streamers",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Peak viewers",)
        #title_standoff = 10)
fig.show()

In [None]:
import plotly.express as px

tab_followers50 = df.sort_values(by=['followers'],ascending=False)

fig = px.bar(tab_followers50.iloc[0:20], x=tab_followers50['screen name'].iloc[0:50], y=tab_followers50['followers'].iloc[0:50], title='Top 50 streamers with the most followers', \
             opacity=1, color_continuous_scale=tab_followers50['followers'].iloc[0:20], )
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Streamers ",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Followers",)
        #title_standoff = 10)
fig.show()

In [None]:
import plotly.graph_objects as go

months = tab_followers50['screen name']

fig = go.Figure()
fig.add_trace(go.Bar(
    x=months,
    y=tab_followers50['followers'].iloc[0:50],
    name='Followers',
    marker_color='indianred',
))
fig.add_trace(go.Bar(
    x=months,
    y=tab_followers50['followers gained'].iloc[0:50],
    name='Followers gained in the last year',
    marker_color='lightsalmon'
))
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Top 50 Streamers ",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Followers",)
        #title_standoff = 10)
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.show()

In [None]:
tab_followers50_gained = df.sort_values(by=['followers gained'],ascending=False)

fig = px.bar(tab_followers50_gained.iloc[0:20], x=tab_followers50_gained['screen name'].iloc[0:50], y=tab_followers50_gained['followers gained'].iloc[0:50], \
             title='Top 50 streamers who have gained the most followers in the last year', \
             opacity=1, color_continuous_scale=tab_followers50_gained['followers gained'].iloc[0:20], )
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Top 50 Streamers ",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Followers",)
        #title_standoff = 10)
fig.show()

In [None]:
tab_followers50_viewers = df.sort_values(by=['views gained'],ascending=False)

months = tab_followers50_viewers['screen name']

fig = go.Figure()
fig.add_trace(go.Bar(
    x=months,
    y=tab_followers50_viewers['views gained'].iloc[0:50],
    name='Viewers earned in the last year',
    marker_color='indianred',
))
fig.add_trace(go.Bar(
    x=months,
    y=tab_followers50_viewers['average viewers'].iloc[0:50],
    name='Average viewers',
    marker_color='lightsalmon'
))
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Top 50 Streamers ",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Followers",)
        #title_standoff = 10)
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.show()

## Partners VS Mature Content

In [None]:
import plotly.express as px

c_df = df.iloc[:]

fig = px.histogram(c_df, x="partnered", color= "mature",title='Streamers that have partners')
fig.show()

fig = px.histogram(c_df, x="mature", color= "partnered",title='Streams for adults')
fig.show()

In [None]:
fig = px.scatter(df, x="top count", y="stream time",
         size="watch time", color="mature",
                 hover_name="screen name", log_x=True, size_max=50, \
                 title='Watch time in the Top 8800')
fig.update_yaxes(autorange="reversed")
fig.update_xaxes(
        #tickangle = 90,
        title_text = "Streamers",
        title_font = {"size": 15})


fig.update_yaxes(
        title_text = "Watch time",)
        #title_standoff = 10)
fig.show()