In [None]:
import numpy as np
import pandas as pd 
import os


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings('ignore')

## How does your ranking on Kaggle or just being on kaggle for a long time affect the popularity for the CTDS show on youtube ? 


* Assumption : the high count of youtube-views / youtube-watchhours can be a proxy for popularity on the show
* You can extend this analysis to other podcast media as well


Dataset - Episodes.csv <br>
External Sources : Meta Kaggle Dataset

Background for this kernel : https://www.kaggle.com/rohanrao/chai-time-data-science/discussion/166575

### Hypothesis, approach & Steps 
* We look at the heroes who have a kaggle account 
* Get their ranks for Competitions , Scripts and Discussions 
* We also gather how long they have been on kaggle ( oldies and new-comers ? )
* See if any of these are factors determine their popularity on youtube . 
* Scale the features 
* Build a pool of potential candidates from kaggle ( based on the above factors )
* Pick a few top heroes ( from CTDS ) who had a good youtube viewership 
* Draw a cosine distance matrix ( comparing the feature-vectors of the heroes and the entire kaggle-pool ) 
( For calculation on least cosine distance  - 
https://www.kaggle.com/tomtillo/cosine-distance-between-top-heroes )
* sort and get the least distance measure


### Summary of results 
### Here are some users you should consider interviewing next -  (these are algorithm generated ) 
( Based on least cosine-distance between the top-heroes who had most youtube views ) 


*     cdeotte ( https://www.kaggle.com/cdeotte )
*     triskelion ( https://www.kaggle.com/triskelion )
*     python10pm ( https://www.kaggle.com/python10pm )
*     roshansharma ( https://www.kaggle.com/roshansharma )
*     caesarlupum ( https://www.kaggle.com/caesarlupum )
*     mpwolke ( https://www.kaggle.com/mpwolke )
*     upadorprofzs ( https://www.kaggle.com/upadorprofzs )
*     tpthegreat ( https://www.kaggle.com/tpthegreat )


## Lets get started. 

In [None]:
df_episodes = pd.read_csv('/kaggle/input/chai-time-data-science/Episodes.csv')   #  Episodes 

### 1. Add the datafiles

* meta-kaggle/Users.csv
* meta-kaggle/UserAchievements.csv 

These files are publically available datasets ( MetaKaggle ) and maintained by Kaggle - <br>
Search for them in the datasets. We will explore these files below.

In [None]:
df_users = pd.read_csv('/kaggle/input/meta-kaggle/Users.csv')   # User meta data 
df_ach = pd.read_csv('/kaggle/input/meta-kaggle/UserAchievements.csv') # User Achievements / levels  dataframe

In [None]:
df_users.head()

### 2. Filter the names where the heroes have their kaggle- usernames populated
( We ignore the others heroes for now )

In [None]:
df_episodes  = pd.DataFrame(df_episodes[~df_episodes['heroes_kaggle_username'].isna()])

Temp cleaning -  For now, remove the row containing 'dott1718 | philippsinger' ( purely for the ease of execution - and iam lazy ) ( Add later, maybe ?? ) <br>
( Or just put **dott1718** as a representative )

In [None]:
df_episodes =  df_episodes[df_episodes['heroes_kaggle_username'] != 'dott1718 | philippsinger' ]

Our dataframe looks like this now ... 

In [None]:
df_episodes[['heroes']].head(10)

In [None]:
print("We have {} heroes who have a kaggle account".format(len(df_episodes)))

### 3.Get the date when the users joined kaggle
( you can make a sub-set dataset from the df_users ( to make searchin faster - Iam just lazy and its just 2 operations we are doing )

In [None]:
df_episodes['kaggle_join_date'] = df_episodes.heroes_kaggle_username.apply(lambda x : df_users[df_users['UserName'] == x].iloc[0,3])

Dataset looks like this now ...

In [None]:
df_episodes[['heroes' ,'kaggle_join_date']].head()

### 4. Get the kaggle user-id ( kaggle user-id is a numeric number like - 1571785 )

In [None]:
df_episodes['kaggle_userid'] = df_episodes.heroes_kaggle_username.apply(lambda x : df_users[df_users['UserName'] == x].iloc[0,0])

See what is populated so far .

In [None]:
df_episodes[['heroes' ,'kaggle_join_date' , 'kaggle_userid']].head(10)

### 5.Pull details from  The Achievements dataset
![](http://)The achievements dataset contains the list of all the kagglers and their points /rankings <br>
It looks like this

In [None]:
df_ach.head()

#### 5.a )  Create a sub-dataset of the achievement dataset for only the users  we really need ( in this case, the heroes)

In [None]:
trunc = df_ach[df_ach.UserId.isin(df_episodes['kaggle_userid'])]
trunc.head()

#### 5.b ) Add  additional columns ( highest rank on Competitions, scripts and discussions )

In [None]:
df_episodes['high_comp'] = df_episodes.kaggle_userid.apply(lambda x : trunc[(trunc['UserId'] == x) & (trunc['AchievementType']=='Competitions')].iloc[0,7])
df_episodes['high_disc'] = df_episodes.kaggle_userid.apply(lambda x : trunc[(trunc['UserId'] == x) & (trunc['AchievementType']=='Scripts')].iloc[0,7])
df_episodes['high_scripts'] = df_episodes.kaggle_userid.apply(lambda x : trunc[(trunc['UserId'] == x) & (trunc['AchievementType']=='Discussion')].iloc[0,7])

See what is populated so far 

In [None]:
df_episodes[['heroes' ,'kaggle_join_date' , 'kaggle_userid' , 'high_comp', 'high_disc', 'high_scripts']].head()

### 6. Create the time line ( from the time they joined kaggle ) - into Months [](http://)

In [None]:
from datetime import date
from datetime import datetime

#function to return date from the date of joining kaggle
def get_months(str_d1):
    str_d2 = '07/12/2020'  # hardcode some date around  last month ( later take this as the date of interview)
    f_date = datetime.strptime(str_d1, '%m/%d/%Y')
    l_date = datetime.strptime(str_d2, '%m/%d/%Y')
    delta = l_date - f_date
    return  delta.days/30

#get_months(df_episodes['kaggle_join_date'].iloc[1])

df_episodes['df_months_in_kaggle'] = df_episodes['kaggle_join_date'].apply(lambda x :get_months(x))

This is what we have populated so far 

In [None]:
df_episodes[['heroes' ,'kaggle_join_date' , 'kaggle_userid' , 'high_comp', 'high_disc', 'high_scripts','df_months_in_kaggle']].head()

### 6. Create the truncated df with only the columns you need

In [None]:
#Building the final df

df_got = df_episodes[['heroes' ,'kaggle_join_date' , 'kaggle_userid' , 'high_comp', 'high_disc', 'high_scripts','df_months_in_kaggle']]
df_got[['youtube_views','youtube_watch_hours']] = df_episodes[['youtube_views','youtube_watch_hours']]

This is how the final dataset looks like 

In [None]:
df_got.head()

### 6. Scale the data using Min-Max Scaler

In [None]:
#make a copy as back up 
df_got2 = df_got.copy()

In [None]:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()

df_got[['df_months_in_kaggle', 'youtube_views','high_comp','high_scripts','high_disc']] = \
scaler.fit_transform(df_got[['df_months_in_kaggle', 'youtube_views','high_comp','high_scripts','high_disc']])

In [None]:
df_got.isna().sum()

Note: There are some NaN values , because some heroes are Unranked in either competitions , scripts or Discussions <br>
If they are NaN , assign them a relative high rank. 

In [None]:

# If the value is NaN , convert it into the highest number
df_got['high_comp'][df_got.high_comp.isna()] = df_got.high_comp.max()
df_got['high_disc'][df_got.high_disc.isna()] = df_got.high_disc.max()
df_got['high_scripts'][df_got.high_scripts.isna()] =df_got.high_scripts.max()

### The scaled dataframe looks like this 

In [None]:
df_got.head()

### Higest Competition Ranking v/s Months on Kaggle v/s Youtube watch hours 

You can do combinations of these ( script / discussion / youtube / other podcasts ) to get more analysis done ( Currently not done here )

In [None]:
import seaborn as sns
sns.set(style="white")

sns.relplot(data=df_got , x="df_months_in_kaggle", y="high_comp", #hue="strength" ,
            #size="youtube_watch_hours",
            size="youtube_watch_hours",
            sizes=(0, 1000), alpha=.4, palette="muted",
            height=6)

### Observations 
Note :  Bigger circles means more youtube viewership
* The heroes who have been on kaggle for a long time - scaled here ( between 0 - 1 ) ( 0 ~ 2020 , 0.5 ~ 2014 , 1 ~ 2010 onwards ) and who have a lower ranking , are more popular on CTDS shows ( here - youtube )
* beware of causality - the variables - ranking and time on kaggle are related but not the cause. It mostly looks like a residual effect.



### outlier removed ( darker colors represent more youtube views )

In [None]:
# Remove Jeremy's show - its an outlier outlier record 
df_got = df_got[~(df_got.youtube_watch_hours >700)]

from matplotlib import pyplot as plt 
x= df_got.df_months_in_kaggle
y= df_got.high_comp
z = df_got.youtube_watch_hours

cmap = sns.cubehelix_palette(rot=-1,as_cmap=True)
#cmap = sns.cubehelix_palette(rot=0.5,as_cmap=True)
f, ax = plt.subplots()
points = ax.scatter( x , y, c=z, s=80, cmap=cmap)
f.colorbar(points)
plt.xlabel("months_in_kaggle")
plt.ylabel("comp_highestrank")
plt.show();

### Alternate analysis -
Get the best ranking of the three categories - Competitions, Scripts , Kernels

In [None]:
df_got['best_3'] = df_got.apply(lambda x:min(x['high_comp'],x['high_scripts'],x['high_disc']),axis = 1)
df_got.head()

In [None]:
df_got[(df_got['high_disc'] < 0.06)].sort_values(by='high_disc',ascending = False).head(1)

In [None]:
df_got[(df_got['df_months_in_kaggle'] > 0.5)].sort_values(by='df_months_in_kaggle',ascending = True).head()

In [None]:
df_ach[(df_ach['UserId'] ==113389) & (df_ach['AchievementType'] =='Discussion') ] #Get the user id from the above output ( todo:make it automated )

In [None]:
import seaborn as sns
sns.set(style="white")

sns.relplot(data=df_got , x="df_months_in_kaggle", y="high_scripts", #hue="strength" ,
            #size="youtube_watch_hours",
            size="youtube_watch_hours",
            sizes=(0, 1000), alpha=.4, palette="muted",
            height=6)
plt.axvspan(.3, .9, color='blue', alpha=0.05)
plt.axhspan(0, .08, color='red', alpha=0.05)
plt.show();

### First level Conclusions on how to use the next Kaggler for interview - 
The intersection of the red and the blue bands ( in the above graph - currently visually detemined ) should be a good sweet-spot to filter potential kagglers to chose for interviews
This translates to 2 filters - Age in Kaggle and best ranking ( either in scripts / discussions / kernels )
( Codes to determine these numbers are hidden in the notebook - unhide them to see )
### Duration in kaggle - 
Potential high-youtube-view heroes could be those kagglers who have joined kaggle before <font color= 'red'>  2016 January </font>!
### Ranking - 
( Codes to determine these numbers are hidden in the notebook - unhide them to see )
* His / Her Best Best Competition ranking is atleast or better than <font color= 'red'>288 </font> (or around)
* His / Her Best Script ranking is atleast or better than<font color= 'red'> 36 </font> ( or around )
* His / Her Best Discusssion ranking is  atleast <font color= 'red'>18 </font> or better ( or around )

Some good candidates for the next round of interviews  ( [based on least-cosine distance between top heroes and pool of kaggle users - link here ](https://www.kaggle.com/tomtillo/cosine-distance-between-top-heroes) )

*     cdeotte ( https://www.kaggle.com/cdeotte )
*     triskelion ( https://www.kaggle.com/triskelion )
*     python10pm ( https://www.kaggle.com/python10pm )
*     roshansharma ( https://www.kaggle.com/roshansharma )
*     caesarlupum ( https://www.kaggle.com/caesarlupum )
*     mpwolke ( https://www.kaggle.com/mpwolke )
*     upadorprofzs ( https://www.kaggle.com/upadorprofzs )
*     tpthegreat ( https://www.kaggle.com/tpthegreat )


### Further improvements
**This is a POC, this should be extended .**
1. Add more attributes ( other than just  3 kaggle rank / age in kaggle ) 
2. Give weightages to ranks ( eg : rank_competition > rank_kernels  > rank_discussion )
2. Try expanding the list from top-3 matches to top-10 matches 
3. Change the filter criteria 
4. Hypothesis - Age in kaggle does not even have an effect ( Bet: it is highly correlated to the competition rank )