## Task 2a. Imports and data loading

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

## Task 2b. Inspecting and understanding the data

In [3]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


*Question 1: When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?*

Each row represents a single TikTok video and data about it. We have a column that tells us whether the video is classified as a claim or not. There is a transcript of the video audio that may contain key words that can tell us a video's classification.

In [4]:
# Get summary info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


Noting that there are 19,382 entries, and multiple columns have 19,084 entries.

In [5]:
19382 - 19084

298

298 missing entries in some columns.

*Question 2: When reviewing the data.info() output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?*

We have a mix of numeric and non-numeric data. 5 columns have data in every row, but the remaining columns have 298 missing rows of data. Since they're all missing the same amount, that could point to the same columns being missing in 298 rows. This will need to be confirmed. The text-based columns are of "object" dtype and may need to be casted into strings in order to parse them properly. Our dataframe takes up around 1.8MB of memory.

In [6]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


Question 3: When reviewing the data.describe() output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

First off, video_id's results should be ignored, as those numbers are a unique identifier and not a quantitative data point about the video. We have a video duration for every video in the frame, but the remaining numeric columns are missing 298 entries. Interaction data about the videos, such as view count and download count, have some very high outliers indicating viral videos. The standard deviation for these columns is more than the mean, which implies a heavy upward skew.

## Task 2c. Understand the data - Investigate the variables
In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the claim_status variable. Begin by determining how many videos there are for each different claim status.

In [7]:
# What are the different values for claim status and how many of each are in the data?
data['claim_status'].value_counts()

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

In [12]:
(9608 + 9476)

19084

In [11]:
9608 / (9608 + 9476)

0.5034583944665688

*Question: What do you notice about the values shown?*

There are slightly more videos classified as containing claims than those containing opinons. Roughly 50.3% claim videos and 49.7% opinion videos.

Our data contains 19,084 videos with claim data, but there are 298 videos that do not have info in that field. We should consider adding dummy values to the missing spots, manually evalute their claim status, or remove those rows of data.

In [13]:
claim_mask = data['claim_status'] == 'claim'
opinion_mask = data['claim_status'] == 'opinion'

In [14]:
# What is the average view count of videos with "claim" status?
data[claim_mask]['video_view_count'].describe()

count      9608.000000
mean     501029.452748
std      291349.239825
min        1049.000000
25%      247003.750000
50%      501555.000000
75%      753088.000000
max      999817.000000
Name: video_view_count, dtype: float64

In [15]:
# What is the average view count of videos with "opinion" status?
data[opinion_mask]['video_view_count'].describe()

count    9476.000000
mean     4956.432250
std      2885.907219
min        20.000000
25%      2467.000000
50%      4953.000000
75%      7447.250000
max      9998.000000
Name: video_view_count, dtype: float64

In [23]:
501029 / (501029 + 4956)

0.990205243238436

*Question: What do you notice about the mean and median within each claim category?*

According to our data, videos containing claims have far higher views that videos containing opinions. The average video with a claim from the user gets 501,000 views, and videos with opinions get just under 5,000 views. That's a big difference, just over 99% of views go to videos containing claims. Their medians are close to the average.

In [16]:
# Get counts for each group combination of claim status and author ban status
data.groupby(['claim_status', 'author_ban_status']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,#,video_id,video_duration_sec,video_transcription_text,verified_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
claim_status,author_ban_status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
claim,active,6566,6566,6566,6566,6566,6566,6566,6566,6566,6566
claim,banned,1439,1439,1439,1439,1439,1439,1439,1439,1439,1439
claim,under review,1603,1603,1603,1603,1603,1603,1603,1603,1603,1603
opinion,active,8817,8817,8817,8817,8817,8817,8817,8817,8817,8817
opinion,banned,196,196,196,196,196,196,196,196,196,196
opinion,under review,463,463,463,463,463,463,463,463,463,463


*Question: What do you notice about the number of claims videos with banned authors? Why might this relationship occur?*

There are far more banned authors of videos containing claims than videos containing opinions. People making false claims and spreading misinformation is against TikTok's TOS, and therefore videos containing claims are more likely to be in violation of the platform's content guidelines. There may also be a corellation between the claim videos having higher engagement counts and the number of reports received about the video. If more people view the content, more of them may report a video that goes against the site rules.

In [18]:
# What's the median video share count of each author ban status?
data.groupby(['author_ban_status'])['video_share_count'].median()

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

*Question: What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.*

Videos created by banned authors have a much larger median view count. Unbanned, active accounts have the least in median views by far. This implies that the most shared content on the platform is in violation of the content policy. This is a concerning data trend, as it means most impressions of TikTok contains content not representative of the platform's rules. It could also imply that TikTok's recommendation algorithim is spreading content in violation of site rules more than valid content. We could also hypothesize that videos that contain rule breaking content are more popular and attract more viewership than content that breaks no rules.

In [20]:
data.groupby(['author_ban_status']).agg({
    'video_view_count': ['count', 'mean', 'median'], 
    'video_like_count': ['count', 'mean', 'median'], 
    'video_share_count': ['count', 'mean', 'median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


*Question: What do you notice about the number of views, likes, and shares for banned authors compared to active authors?*

This view of the data confirms that banned authors have received the most engagement on their videos compared to users under a ban review and unbanned users.

In [21]:
# Create a likes_per_view column
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']

# Create a comments_per_view column
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']

# Create a shares_per_view column
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

In [22]:
data.groupby(['claim_status', 'author_ban_status']).agg({
    'likes_per_view': ['count', 'mean', 'median'], 
    'comments_per_view': ['count', 'mean', 'median'], 
    'shares_per_view': ['count', 'mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


*Question: How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.*

This data shows that videos with claims get much more engagement than videos sharing opinions, and even have better ratios of views to engagement. Likes per view, comments per view, and shares per view are all more on claim videos than opinion videos.

Within claim videos, banned users receive slightly more engagement, while active and under review users get about the same ratio of engagement. However the same is not true of opinion videos. Within that kind of video, banned users get the lowest engagement ratio, followed by active and then under review authors.

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?

- Roughly 50.3% videos in our dataset contain claims, and around 49.7 contain opinions.
- Videos with claims are more likely to be by banned or under ban review authors. Videos containing claims also get more engagement in the form of likes, comments, and shares per view (the engagement ratio) than videos with opinions. The same relationship is true of views to a massive degree. Claim containing videos received 99% of views in this dataset.
- The videos with the highest views and engagement ratio contain claims and are by banned authors. There are a few ways this could imply problems for the platform:
  - Videos containing content against site rules receive the most views and engagement
  - TikTok's recommendation algorithim may be pushing videos containing content against site rules more than non-violating videos
  - Videos with rule breaking content may be intrinsically more popular and attract more attention than non-violating videos