# Data Cleaning and Basic Exploration

**April 9, 2019 Update:**

1. Use the new [plotly theme](https://medium.com/@plotlygraphs/introducing-plotly-py-theming-b644109ac9c7) (plotly_white).
2. Explicitly set widht and height properties of the plots.
3. Add [a new notebook([YACND] Starter Notebook v2)](https://www.kaggle.com/ceshine/yacnd-starter-notebook-v2) using Plotly Express and experimenting with language detection.

In [1]:
import pandas as pd
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

## Reading in Data

In [2]:
df = pd.read_csv("../input/news_collection.csv", parse_dates=["date"])
df.sample(5)

Unnamed: 0,title,desc,image,url,source,date
8868,《748施行法》草案刻意排除「共同收養」規定，憲法能夠允許嗎？,儘管《748施行法》草案刻意排除有關共同收養之規定，這是政院的主觀意圖，但本草案立法目的──...,https://image6.thenewslens.com/2019/2/rypx9txi...,https://www.thenewslens.com/article/114708,NewsLens,2019-03-04
19755,台中水湳會展中心動工 盧秀燕盼帶動經濟,台中市水湳國際會展中心今天開工動土，台中市長盧秀燕表示，未來台中的企業將可以在自己的故鄉辦展...,https://cdn.taronews.tw/files/2018/03/09.png,https://taronews.tw/2019/03/23/289327/,芋傳媒,2019-03-24
37942,花博危機暫時解除 山手線之後 盧秀燕、林佳龍再度過招,在媒體報導台中花博恐因台中新任市長盧秀燕免費政策導致主辦權遭取消後，台中市府今（,https://www.cmmedia.com.tw/file/23346/15489943...,https://www.cmmedia.com.tw/home/articles/14156,信傳媒,2019-02-01
24617,美軍再空襲索馬利亞青年黨 殺死26名聖戰士,美國軍方近期加強對索馬利亞青年黨（Al-Shabaab）好戰分子的空襲行動。美軍官員今天表示...,https://cdn.taronews.tw/files/2019/03/20190301...,https://taronews.tw/2019/03/02/268847/,芋傳媒,2019-03-03
9116,是少子化，還是眼中釘：創立22年的世新大學社發所「被停招」,創立至今22年，培育無數社會工作者的世新大學社發所在現有學生畢業後恐將走入歷史，你我為什麼要...,https://image6.thenewslens.com/2019/1/1bggudbo...,https://www.thenewslens.com/article/111440,NewsLens,2019-01-06


In [3]:
f"Number of entries: {df.shape[0]:,d}"

'Number of entries: 55,175'

## Cleaning the Data

There are some duplicates in the dataset:

In [4]:
df[df.duplicated(["title", "desc", "image", "url", "source"], keep=False)].sort_values(["url", "date"]).head()

Unnamed: 0,title,desc,image,url,source,date
1649,新加坡前传 Treasures Before Us,被遗忘的新加坡700年历史，在数以万计的文物中苏醒。,//interactive.zaobao.com/2019/treasures-before...,http://interactive.zaobao.com/2019/treasures-b...,Zaobao,2019-01-28
39298,新加坡前传 Treasures Before Us,被遗忘的新加坡700年历史，在数以万计的文物中苏醒。,//interactive.zaobao.com/2019/treasures-before...,http://interactive.zaobao.com/2019/treasures-b...,Zaobao,2019-01-30
200,从高通苹果专利战看中国忽略的三件事,刘远举：知识产权更重要的是执法，而不在于惩罚侵权者。用道德档案进行间接威慑的方式，是舍本逐末...,http://i.ftimg.net/picture/9/000073199_piclink...,http://www.ftchinese.com/story/001080669?full=...,FT,2018-12-14
38794,从高通苹果专利战看中国忽略的三件事,刘远举：知识产权更重要的是执法，而不在于惩罚侵权者。用道德档案进行间接威慑的方式，是舍本逐末...,http://i.ftimg.net/picture/9/000073199_piclink...,http://www.ftchinese.com/story/001080669?full=...,FT,2018-12-15
15259,用法治之战化解华为危机,盛洪：说孟晚舟案应是一场法治之战，不仅指在加拿大或美国法庭上的诉辨对抗，还是指两个法律体系之...,http://i.ftimg.net/picture/0/000063990_piclink...,http://www.ftchinese.com/story/001080753,FT,2018-12-24


Examine the most frequent entries:

In [5]:
df[
    df.duplicated(["title", "desc", "image", "url", "source"], keep=False)
].groupby(["title", "url", "source"]).size().to_frame("cnt").sort_values("cnt", ascending=False).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cnt
title,url,source,Unnamed: 3_level_1
SGSME.SG,https://www.sgsme.sg/,Zaobao,93
"The Wall Street Journal & Breaking News, Business, Financial and Economic News, World News and Video",https://www.wsj.com/europe,WSJ,33
觸摸時代的政經脈搏，我們想成為你觀察時代的可靠伙伴。,https://membership.theinitium.com/,Initium,33
美国之音中文网 您可靠的信息来源,https://www.voachinese.com/,VOA,33
春来早报喜2019－诸事如意,https://www.zaobao.com.sg/special/report/singapore/chinese-new-year-2019?utm_source=facebook&utm_medium=social-organic&utm_campaign=cny2019&utm_content=end-text,Zaobao,15


We see that some entries are not really news, but links to websites, instagram accounts, youtube, and online surveys.

In [6]:
blacklists = [
    "https://www.sgsme.sg/", "https://www.voachinese.com/", "https://www.voachinese.com/z/5102", 
    "https://www.instagram.com/voachinese/", "https://cn.wsj.com/"
]

Deduplicate:

In [7]:
df = df[~df.duplicated(["title", "desc", "image", "url", "source"], keep="first")]
df.shape[0]

54163

Remove those in the blacklist:

In [8]:
df = df[~df.url.isin(blacklists)]
df.shape[0]

54157

Remove youtube videos:

In [9]:
df = df[~df.url.str.startswith("https://www.youtube.com")]
df.shape[0]

54151

## Simple Visualizations

In [10]:
source_counts = df.source.value_counts()

iplot(go.Figure(
    data=[
        go.Bar(
            x=source_counts.index,
            y=source_counts.values
        )
    ], 
    layout=go.Layout(
        title='Article Counts by Source',
        width=800, height=400, template="plotly_white"
    )
))

In [11]:
date_counts = df.date.value_counts()

iplot(go.Figure(
    data=[
        go.Bar(
            x=date_counts.index.strftime("%Y-%m-%d"),
            y=date_counts.values
        )
    ], 
    layout=go.Layout(
        title='Article Counts by Date',
        width=800, height=400, template="plotly_white"
    )
))

In [12]:
date_counts = df[df.source == "NYTimes"].date.value_counts()

iplot(go.Figure(
    data=[
        go.Bar(
            x=date_counts.index.strftime("%Y-%m-%d"),
            y=date_counts.values
        )
    ], 
    layout=go.Layout(
        title='New York Times (CN) Article Counts by Date',
        width=800, height=400, template="plotly_white"
    )
))

## Check the Mentioning of People in the Titles

### Trump

Note: this dataset does not guarantee full coverage. So the following seciont is not necessarily an accurate account of the coverage on Trump and Xi. It just serves as an simpe example.

In [13]:
df["trump"] = (
    df.title.str.contains("川普") |
    df.title.str.contains("特朗普")
)
f'% of titles mentioning Trump: {df["trump"].sum() / df.shape[0] * 100:.2f}%'

'% of titles mentioning Trump: 3.72%'

Percentage of titles mentioning Trump by Sources:

In [14]:
trump_perc_by_source = df.groupby("source")["trump"] .mean().sort_values() * 100

iplot(go.Figure(
    data=[
        go.Bar(
            y=trump_perc_by_source.index,
            x=trump_perc_by_source.values,
            orientation = 'h'
        )
    ], 
    layout=go.Layout(
        title='Percentage of Titles mentioning Trump by Sources',
        xaxis=dict(
            title='%',
        ),
        width=800, height=400, template="plotly_white"
    )
))

In [15]:
turmp_perc_by_date = df.groupby("date")["trump"] .mean().sort_values() * 100

iplot(go.Figure(
    data=[
        go.Bar(
            x=turmp_perc_by_date.index.strftime("%Y-%m-%d"),
            y=turmp_perc_by_date.values
        )
    ], 
    layout=go.Layout(
        title='Percentage of Titles mentiong Trump by Date',
        yaxis=dict(
            title='%',
        ),
        width=800, height=400, template="plotly_white"
    )
))

### Xi Jinping

Now we repeat the same process with Chinese president Xi:

In [16]:
df["xi"] = (
    df.title.str.contains("習近平") |
    df.title.str.contains("习近平")
)
f'% of titles mentioning Xi: {df["xi"].sum() / df.shape[0] * 100:.2f}%'

'% of titles mentioning Xi: 1.44%'

In [17]:
xi_perc_by_source = df.groupby("source")["xi"] .mean().sort_values() * 100

iplot(go.Figure(
    data=[
        go.Bar(
            y=xi_perc_by_source.index,
            x=xi_perc_by_source.values,
            orientation = 'h'
        )
    ], 
    layout=go.Layout(
        title='Percentage of Titles mentioning Xi by Sources',
        xaxis=dict(
            title='%',
        ),
        width=800, height=400, template="plotly_white"
    )
))

We can combine the two bar charts (the readibility still need some work, though): 

In [18]:
xi_perc_by_source = df.groupby("source")["xi"] .mean().sort_values() * 100

iplot(go.Figure(
    data=[
        go.Bar(
            y=trump_perc_by_source.index,
            x=trump_perc_by_source.values,
            orientation = 'h',
            name="Trump"
        ),
        go.Bar(
            y=xi_perc_by_source.index,
            x=xi_perc_by_source.values,
            orientation = 'h',
            name="Xi"
        ),        
    ], 
    layout=go.Layout(
        title='Percentage of Titles mentioning Xi and Trump by Sources',
        xaxis=dict(
            title='%',
        ),
        width=800, height=600, template="plotly_white"
    )
))

In [19]:
xi_perc_by_date = df.groupby("date")["xi"] .mean().sort_values() * 100

iplot(go.Figure(
    data=[
        go.Bar(
            x=xi_perc_by_date.index.strftime("%Y-%m-%d"),
            y=xi_perc_by_date.values
        )
    ], 
    layout=go.Layout(
        title='Percentage of Titles mentiong Xi by Date',
        yaxis=dict(
            title='%',
        ),
        width=800, height=400, template="plotly_white"
    ),
))

A lot of Xi titles on Jan 3rd. We can take a look at what that's about:

In [20]:
df[(df.date == "2019-01-03") & df.xi][["title", "desc", "source"]].sample(5)

Unnamed: 0,title,desc,source
50747,4200字講稿提了47次「統一」 習近平對台談話透露的訊號...,中國國家主席習近平1月2日發表《告台灣同胞書》，說到「中國人不打中國人」，但是，,信傳媒
50613,外媒：習近平變相威嚇 著眼台灣2020總統大選「換人」,北京紀念《告台灣同胞書》40周年，昨天由國家主席習近平發表對台政策講話，提出「習五點」。不只...,NewTalk.tw
50915,习近平倡议一国两制，蔡英文回应“绝不接受”,中国国家主席习近平星期三在告台湾同胞书发表40周年纪念会上表示，一国两制是为了照顾台湾的现时...,VOA
50972,習近平提和平統一 馬英九：蔡態度讓中國感受到緊迫性,中共總書記習近平昨天在北京人民大會堂發表「告台灣同胞40週年」談話，對此，前總統馬英九今（3...,NewTalk.tw
50824,習近平「告台灣同胞書40週年」談話：對台獨舞劍，意在「中華民國」項上人頭,無論擋在習近平前面的是中華民國還是台灣，他通通都要消滅掉，因為他必須在自己老到會被政敵威脅前...,NewsLens


## The End

This is the end of this simple starter notebook. Hopefully you'll find this dataset interesting.

There are some issues that was not covered here, but might be of interest to you:

1. The conversion of Traditional and Simplified Chinese.
2. The overabundance of contents from some sources. That could mean a lot of short breaking news pieces, or even just some low-quality contents. Might need some investigation.
3. The summary field of some sources are just truncated version of the full article. The final sentence are usually not complete. You might want to remove that sentence.