## Scenario

You are a data professional at TikTok. The current project is reaching its midpoint; a project proposal, Python coding work, and exploratory data analysis have all been completed.

The team has reviewed the results of the exploratory data analysis and the previous executive summary the team prepared. You received an email from Orion Rainier, Data Scientist at TikTok, with your next assignment: determine and conduct the necessary hypothesis tests and statistical analysis for the TikTok classification project.

## Data exploration and hypothesis testing

1. **The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.

2. **The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>


## Plan Stage

1. Research question for this data project:
   * Do videos from verified accounts and videos unverified accounts have different average view counts?
   * Is there a relationship between the account being verified and the associated videos' view counts?

In [1]:
# import relevant libraries
# data manipulation
import pandas as pd
import numpy as np

# data visualization
import seaborn as sns
import matplotlib.pyplot as plt

# statistical modelling
from scipy import stats

In [2]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


## Analyze and Construct Stage

In [3]:
data.shape

(19382, 12)

In [4]:
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [5]:
# checking for any missing values
missing_values = round((data.isna().sum() / data.shape[0]) * 100, 2)
print(f'Percentage of missing_values:\n{missing_values}')

Percentage of missing_values:
#                           0.00
claim_status                1.54
video_id                    0.00
video_duration_sec          0.00
video_transcription_text    1.54
verified_status             0.00
author_ban_status           0.00
video_view_count            1.54
video_like_count            1.54
video_share_count           1.54
video_download_count        1.54
video_comment_count         1.54
dtype: float64


* Given that the perc of missing values are signficantly lower, by droping them there would be no big impact on the analyis.

In [6]:
# drop the missing values
data = data.dropna(axis =0)

In [7]:
data.columns.to_list()

['#',
 'claim_status',
 'video_id',
 'video_duration_sec',
 'video_transcription_text',
 'verified_status',
 'author_ban_status',
 'video_view_count',
 'video_like_count',
 'video_share_count',
 'video_download_count',
 'video_comment_count']

Since our interest is in status of verfication and the video view count, we can assess the means per category

In [8]:
data.groupby('verified_status')['video_view_count'].mean().round(2)

verified_status
not verified    265663.79
verified         91439.16
Name: video_view_count, dtype: float64

### Hypothesis Testing

**Null hypothesis:** There is no difference in number of views between TikTok videos posted by verified accounts and posted by unverified accounts (any observed difference in the sample data is due to chance or sampling variability).
<br/>
**Alternative hypothesis:** There is a difference in number of views between TikTok videos posted by verified accounts and those posted by unverified accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).

In [9]:
# Conducting a two-sample t-test to compare means

not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

TtestResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

P_value threshold = 0.05, significance == p_value < 0.05
* Given that the p value is more than 0.05 we fail to reject the null hypothesis.
* Conclusion: There is a difference in the number of views btwn Tiktok videos posted by verified account and those posted by unverified accounts.

Further analysis would be useful to identify whether this phenomenon is influenced by user behavior.