### **Background**

The background of the problem is very important to know because it will make it easier to identify the source of the problem and determine the right solution. Background information on the problem can be used as a basis for evaluating data and making wise decisions in dealing with the problem at hand. In this context it includes:
1. Do trending videos have the same quality and characteristics even though the video attributes vary?

1. Why can increased user accessibility help manage trending videos?
1. How can screening process help determine the suitability of uploaded content so that views and engagement increase?
1. What features can help improve the relationship between creators and content connoisseurs so that new content will be positively received?
1. How and how much is the relationship between various attributes in determining the sustainability of trending videos?

The hypothesis being tested is the anti-thesis of the problem background. Everything about the data has no significant intercorrelation and even if there is it is just coincidence.

### **Data Understanding**

##### **Ignore Future Warnings**

In [1]:
import warnings

# ignore future warnings
warnings.simplefilter(
    action='ignore',
    category=FutureWarning
)

This dataset is provided as material for working on the topic of Trending Video Statistics on YouTube specifically for the United States region. In the early stages, the information contained will be described in more depth to understand its characteristics.

##### **Import datasets into dataframe**

Import the dataset to be analyzed into a dataframe using library **Pandas**.

In [2]:
import pandas as pd

pd.set_option('display.max_colwidth',30)
data = pd.read_csv('USvideos.csv')
data.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR ...,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2ky...,False,False,False,SHANTELL'S CHANNEL - https...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,last week tonight trump pr...,2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZA...,False,False,False,One year after the preside...
2,5qpjK5DgCt4,17.14.11,Racist Superman | Rudy Man...,Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""ma...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qp...,False,False,False,WATCH MY PREVIOUS VIDEO ▶ ...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or...,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good...",343168,10172,666,2146,https://i.ytimg.com/vi/puq...,False,False,False,Today we find out if Link ...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""niga...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d38...,False,False,False,I know it's been a while s...


As material for analysis, for each row the data contained consists of 16 columns, each of which contains information as follows:

| Column | Description |
| --- | ---  |
| `video_id` | *unique identifier* for each video on **YouTube** |
| `trending_date` | the date when the related video was popular |
| `title` | the title name of the related video that is *mandatory* |
| `channel_title` | the name of the channel where the related video originates or is collected |
| `category_id` | related video categorizations listed are in the form of numbers but are nominal in nature |
| `publish_time` | the date when the associated video was released |
| `tags` | words or phrases used as the context of the associated video |
| `views` | the number of users who played the associated video |
| `likes` | the number of users who gave positive *feedback* by pressing the *likes* button |
| `dislikes` | the number of users who gave negative *feedback* by pressing the *dislikes* button |
| `comment_count` | the number of comments left by users regarding the associated video |
| `thumbnail_link` | image link that represents the related video to entice potential viewers to play the video |
| `comments_disabled` | a feature that can be used by video owners so that viewers cannot comment on related videos |
| `ratings_disabled` | a feature used by video owners so that viewers cannot give any *feedback*, such as *likes* and *dislikes* |
| `video_error_or_removed` | a condition where the related video cannot be played back |
| `description` | a description of specific information from the linked video |

##### **Data Infographics**

Data infographics contain information about the amount of data and columns and the types of data contained in each column. For each column, the data contained must be of the same data type.

In [3]:
import numpy as np

print(
    'Dataframe Infographic: \n\n1. Dataframe has {} rows dan {} columns.'.format(data.shape[0],data.shape[1]),
    '\n2. There are {} data types, those are {}.'.format(
        len(set(str(i) for i in data.dtypes.values)),
        ', '.join(list(set(str(i) for i in data.dtypes.values)))
    )
)

pd.DataFrame(data.dtypes).reset_index().rename(columns={
    'index':'Column',
    0:'Data Type'
}).set_index(np.r_[1:17])

Dataframe Infographic: 

1. Dataframe has 40949 rows dan 16 columns. 
2. There are 3 data types, those are object, bool, int64.


Unnamed: 0,Column,Data Type
1,video_id,object
2,trending_date,object
3,title,object
4,channel_title,object
5,category_id,int64
6,publish_time,object
7,tags,object
8,views,int64
9,likes,int64
10,dislikes,int64


##### **Descriptive statistics**

Descriptive statistics provide a summary as well as structure the characteristics of a data set. In the case of numeric data, it will contain information about the frequency of occurrence, center size and distribution. For those that are non-numeric, it only contains information on what data appears most often along with the frequency of its occurrence.

The following are descriptive statistics for each numeric column.

In [4]:
numeric = data.describe()
numeric.iloc[1:3] = numeric.iloc[1:3].apply(lambda row: row.map('{:.2f}'.format),axis=1) 
numeric

Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,40949.0,40949.0,40949.0,40949.0,40949.0
mean,19.97,2360784.64,74266.7,3711.4,8446.8
std,7.57,7394113.76,228885.34,29029.71,37430.49
min,1.0,549.0,0.0,0.0,0.0
25%,17.0,242329.0,5424.0,202.0,614.0
50%,24.0,681861.0,18091.0,631.0,1856.0
75%,25.0,1823157.0,55417.0,1938.0,5755.0
max,43.0,225211923.0,5613827.0,1674420.0,1361580.0


The following is descriptive statistics for each column that is object and boolean.

In [5]:
pd.set_option('display.max_colwidth',30)

display(
    data.describe(include='object').loc[['unique','top','freq']],
    data.describe(include='bool').loc[['top','freq']]
)

Unnamed: 0,video_id,trending_date,title,channel_title,publish_time,tags,thumbnail_link,description
unique,6351,205,6455,2207,6269,6055,6352,6901
top,j4KvrAUjn6c,17.14.11,WE MADE OUR MOM CRY...HER ...,ESPN,2018-05-18T14:00:04.000Z,[none],https://i.ytimg.com/vi/j4K...,► Listen LIVE: http://powe...
freq,30,200,30,203,50,1535,30,58


Unnamed: 0,comments_disabled,ratings_disabled,video_error_or_removed
top,False,False,False
freq,40316,40780,40926


The descriptive statistics are only brief information from the dataset to be processed. It is necessary to carry out the process of Data Preparation with the aim of being prepared as material for analysis.

##### **Unique Data**

Unique data provides basic information about the inclusive data contained in a column. From here, the characteristics of the information can be seen, including data uniformity, data writing patterns, and even data types. In some cases, the data format must be adjusted to be more structured in order to meet data analysis requirements.

In [6]:
items = []

for col in data.columns:
    items.append([col,data[col].nunique(),', '.join(data[col].unique()[0:5].astype(str))])

pd.set_option('display.max_colwidth',100)
pd.DataFrame(items,columns=[
    'Column','Unique Data Counts','5 Examples of Unique Data'
])

Unnamed: 0,Column,Unique Data Counts,5 Examples of Unique Data
0,video_id,6351,"2kyS6SvSYSE, 1ZAPwfrtAFY, 5qpjK5DgCt4, puqaWrEC7tY, d380meD0W0M"
1,trending_date,205,"17.14.11, 17.15.11, 17.16.11, 17.17.11, 17.18.11"
2,title,6455,"WE WANT TO TALK ABOUT OUR MARRIAGE, The Trump Presidency: Last Week Tonight with John Oliver (HB..."
3,channel_title,2207,"CaseyNeistat, LastWeekTonight, Rudy Mancuso, Good Mythical Morning, nigahiga"
4,category_id,16,"22, 24, 23, 28, 1"
5,publish_time,6269,"2017-11-13T17:13:01.000Z, 2017-11-13T07:30:00.000Z, 2017-11-12T19:05:24.000Z, 2017-11-13T11:00:0..."
6,tags,6055,"SHANtell martin, last week tonight trump presidency|""last week tonight donald trump""|""john olive..."
7,views,40478,"748374, 2418783, 3191434, 343168, 2095731"
8,likes,29850,"57527, 97185, 146033, 10172, 132235"
9,dislikes,8516,"2966, 6146, 5339, 666, 1989"


From this we find that:
- The `trending_date` and `publish_time` columns must be adjusted to the date format so that they are uniform.

- Columns `comments_disabled`, `ratings_disabled`, and `video_error_or_removed`, the data must be translated so that the meaning of the values ​​contained therein are conveyed.

- Columns `video_id`, `title`, `channel_title`, `tags`, `thumbnail_link` and `description`, the data is object so that the information contained is not uniform in terms of the writing and in several columns are related to the URL of the related video. This needs to be further examined in the Data Preparation section.

- Columns that are numeric, such as `category_id`, `views`, `likes`, `dislikes`, and `comment_count`, are classified into two data types, namely `int64` and ` object`. Even if the column `category_id` is of type `object`, this will not cause any problems as it only represents nominal categorical data. In anticipation, we can replace it with more representative data in the Data Preparation process.