Corrects urls and add information to the original dataset.

___

In [1]:
import pandas as pd
import numpy as np
from utils import *

Read .csv files:

In [2]:
df_original = pd.read_csv("data/youtube-a-lecole-hutin.csv", delimiter=";", encoding='utf-8')
df_url_corrected = pd.read_csv("data/url_corrected.csv", delimiter=",", encoding='utf-8')
df_ydt_infos = pd.read_csv("data/all_channels_depth0_nosub_340_358.csv", delimiter=";", encoding='utf-8')

In [3]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354 entries, 0 to 353
Data columns (total 7 columns):
Ordre                                     354 non-null float64
Discipline                                354 non-null object
Nom de la chaine YouTube                  354 non-null object
URL de la chaine YouTube                  354 non-null object
Mots-clés                                 340 non-null object
Description (par le/la vidéaste)          300 non-null object
Description (par l'auteure du rapport)    54 non-null object
dtypes: float64(1), object(6)
memory usage: 19.5+ KB


Rename columns:

In [4]:
df_original.columns = ['idx', 'category', 'name', 'url', 'keywords', 'desc', 'desc_rapport']
df_original["idx"] = df_original["idx"].astype(int).copy()

In [5]:
df_original.head(2)

Unnamed: 0,idx,category,name,url,keywords,desc,desc_rapport
0,196,Technologie et informatique,Léo - Techmaker,https://www.youtube.com/channel/UCRhyS_ylPQ5GW...,Technologie-ingénierie-électronique-Do it your...,« Salut c'est Léo de la chaine TechMaker ! Tes...,
1,212,Arts et histoire de l’art,N’art,https://www.youtube.com/channel/UCQq9fMRQhXOyO...,Transversal-critiques d'expo,"« Apprendre à reconnaître le style de Klimt, c...",


Add corrected urls to the original dataset:

In [6]:
df_corrected = df_original.merge(df_url_corrected[['url', 'url_corrected']], on="url", how='left').copy()
df_corrected.head(1)

Unnamed: 0,idx,category,name,url,keywords,desc,desc_rapport,url_corrected
0,196,Technologie et informatique,Léo - Techmaker,https://www.youtube.com/channel/UCRhyS_ylPQ5GW...,Technologie-ingénierie-électronique-Do it your...,« Salut c'est Léo de la chaine TechMaker ! Tes...,,


Merge urls:

In [7]:
df_corrected['url_final'] = np.where(df_corrected['url_corrected'].isnull(), df_corrected['url'], df_corrected['url_corrected'])

# check if we are right
df_corrected[df_corrected["idx"].isin([196, 208])][["idx", "url", "url_corrected", "url_final"]]


Unnamed: 0,idx,url,url_corrected,url_final
0,196,https://www.youtube.com/channel/UCRhyS_ylPQ5GW...,,https://www.youtube.com/channel/UCRhyS_ylPQ5GW...
6,208,https://www.youtube.com/user/Rmngrandpalais/pl...,https://www.youtube.com/channel/UCyAiVPzrW_o5P...,https://www.youtube.com/channel/UCyAiVPzrW_o5P...


Compute the channel's YouTube unique identifier from the url:

In [8]:
df_corrected["ydt_id"] = df_corrected["url_final"].apply(get_id)

In [9]:
df_corrected[["idx", "url_final", "ydt_id"]].head(2)

Unnamed: 0,idx,url_final,ydt_id
0,196,https://www.youtube.com/channel/UCRhyS_ylPQ5GW...,UCRhyS_ylPQ5GWBl1lK92ftA
1,212,https://www.youtube.com/channel/UCQq9fMRQhXOyO...,UCQq9fMRQhXOyOZeageaj6ag


In [10]:
urls = get_urls("ydt_id", False, False, df_corrected)

In [11]:
df_ydt_infos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 340 entries, 0 to 339
Data columns (total 11 columns):
Id                 340 non-null object
Label              340 non-null object
timeset            0 non-null float64
isseed             340 non-null object
seedrank           340 non-null int64
subscribercount    340 non-null int64
videocount         340 non-null int64
viewcount(100s)    340 non-null int64
country            340 non-null object
publishedat        340 non-null object
daysactive         340 non-null int64
dtypes: float64(1), int64(5), object(5)
memory usage: 29.3+ KB


In [12]:
df_ydt_infos.rename(columns = {"Id":"ydt_id"}, inplace=True)

In [13]:
print(len(set(df_original['url'])), ": number of unique urls")
print(len(set(df_ydt_infos['ydt_id'])), ": number of unique YouTube identifiers")
print(len(df_ydt_infos), ": same as before (no duplicates)")
print(len(set(df_original['url']))-len(set(df_ydt_infos['ydt_id'])), ": number of urls unused by YDT")
print(len(set(df_corrected['ydt_id'])))
print(len(df_corrected.merge(df_ydt_infos, on="ydt_id")))

348 : number of unique urls
340 : number of unique YouTube identifiers
340 : same as before (no duplicates)
8 : number of urls unused by YDT
346
350


In [14]:
df_corrected.merge(df_ydt_infos, on="ydt_id").to_csv("data/generated/data_with_info.csv", sep=';', encoding='utf-8')