# Data Cleaning and Basic Exploration v2

20190507 Update: px.bar now supports barmode="group"

20190424 Update: Replace `langua` with a simpler `re` language dector.

Updates:

1. Use the new Plotly Express API
2. Use `langua` to detect language to weed out problematic entries

## Table of Contents

1. [Imports](#Imports)
2. [Reading in Data](#Reading-in-data)
3. [Cleaning the Data](#Cleaining-the-data)
  - [Dedpulicate](#Dedpulicate)   
  - [Remove entries that are not written in Chinese](#Remove-entries-that-are-not-written-in-Chinese)    
4. [Simple Visualizations](#Simple-Visualizations)
5. [Check the mention of people in the title](#Check-the-Mentioning-of-People-in-the-Titles)
  - [Trump](#Trump)
  - [Xi Jinping](#Xi-Jinping)
6. [The End](#The-End)

## Imports

In [None]:
from pathlib import Path

import pandas as pd
import plotly_express as px

INPUT_FOLDER = Path("../input/")

## Reading in Data

In [None]:
df = pd.read_csv(INPUT_FOLDER / "news_collection.csv", parse_dates=["date"])
df.sample(5)

In [None]:
f"Number of entries: {df.shape[0]:,d}"

## Cleaning the Data

### Dedpulicate

There are some duplicates in the dataset:

In [None]:
df[df.duplicated(["title", "desc", "image", "url", "source"], keep=False)].sort_values(["url", "date"]).head()

Examine the most frequent entries:

In [None]:
df[
    df.duplicated(["title", "desc", "image", "url", "source"], keep=False)
].groupby(["title", "url", "source"]).size().to_frame("cnt").sort_values("cnt", ascending=False).head()

We see that some entries are not really news, but links to websites, instagram accounts, youtube, and online surveys.

In [None]:
blacklists = [
    "https://www.sgsme.sg/", "https://www.voachinese.com/", "https://www.voachinese.com/z/5102", 
    "https://www.instagram.com/voachinese/", "https://cn.wsj.com/", "https://www.wsj.com/europe"
]

Deduplicate:

In [None]:
df = df[~df.duplicated(["title", "desc", "image", "url", "source"], keep="first")]
df.shape[0]

Remove those in the blacklist:

In [None]:
df = df[~df.url.isin(blacklists)]
df.shape[0]

Remove youtube videos:

In [None]:
df = df[~df.url.str.startswith("https://www.youtube.com")]
df.shape[0]

### Remove entries that are not written in Chinese

The news feeds can be quite noisy at times (especially RFI.fr), we need to find non-Chinese entries that are put in to the feed by mistake and remove them.

In [None]:
import re
def cjk_detect(texts):
    texts = str(texts)
    # korean
    if re.search("[\uac00-\ud7a3]", texts):
        return "ko"
    # japanese
    if re.search("[\u3040-\u30ff]", texts):
        return "ja"
    # chinese
    if re.search("[\u4e00-\u9FFF]", texts):
        return "zh"
    return "others"

In [None]:
%%time
df["lang"] = df.apply(cjk_detect, axis=1)

In [None]:
df["lang"].value_counts()

In [None]:
df[df["lang"]=="ko"].head()

In [None]:
df[df["lang"]=="others"].sample(5)

Looks like we can include Japanese results:

In [None]:
df[df["lang"]=="ja"].sample(5)

In [None]:
df[df["lang"]=="zh"].sample(5)

In [None]:
print("Before:", df.shape[0])
df = df[df.lang.isin(("ja", "zh"))].copy()
print("After:", df.shape[0])

## Simple Visualizations

In [None]:
source_counts = df.source.value_counts().to_frame("Count").reset_index()

px.bar(
    source_counts, x="index", y="Count", template="plotly_white",
    labels=dict(Count="Number of Entries", index="Source"), 
    width=800, height=400, title="# of News Entries by Source"
)

In [None]:
date_counts = df.date.value_counts().to_frame("Count").reset_index()
date_counts["index"] = date_counts["index"].dt.strftime("%Y-%m-%d")

px.bar(
    date_counts, x="index", y="Count", template="plotly_white",
    labels=dict(Count="Number of Entries", index="Date"), 
    width=800, height=400, title="Article Counts by Date"
)

In [None]:
date_counts = df[df.source == "NYTimes"].date.value_counts().to_frame("Count").reset_index()
date_counts["index"] = date_counts["index"].dt.strftime("%Y-%m-%d")

px.bar(
    date_counts, x="index", y="Count", template="plotly_white",
    labels=dict(Count="Number of Entries", index="Date"), 
    width=800, height=400, title="New York Times (CN) Article Counts by Date"
)

## Check the Mentioning of People in the Titles

### Trump

Note: this dataset does not guarantee full coverage. So the following seciont is not necessarily an accurate account of the coverage on Trump and Xi. It just serves as an simpe example.

In [None]:
df["trump"] = (
    df.title.str.contains("川普") |
    df.title.str.contains("特朗普")
)
f'% of titles mentioning Trump: {df["trump"].sum() / df.shape[0] * 100:.2f}%'

Percentage of titles mentioning Trump by Sources:

In [None]:
trump_perc_by_source = df.groupby("source")["trump"].mean().sort_values() * 100
trump_perc_by_source = trump_perc_by_source.to_frame("Perc").reset_index()

px.bar(
    trump_perc_by_source, x="Perc", y="source", template="plotly_white",
    labels=dict(Perc="%", source="Source"), 
    width=800, height=400, title="Percentage of Titles mentioning Trump by Sources",
    orientation="h"
)

In [None]:
trump_perc_by_date = df.groupby("date")["trump"].mean().sort_values() * 100
trump_perc_by_date = trump_perc_by_date.to_frame("Perc").reset_index()
trump_perc_by_date["date"] = trump_perc_by_date["date"].dt.strftime("%Y-%m-%d")

px.bar(
    trump_perc_by_date, x="date", y="Perc", template="plotly_white",
    labels=dict(Perc="%", date="Date"), 
    width=800, height=400, title="Percentage of Titles mentiong Trump by Date",
    orientation="v"
)

### Xi Jinping

Now we repeat the same process with Chinese president Xi:

In [None]:
df["xi"] = (
    df.title.str.contains("習近平") |
    df.title.str.contains("习近平")
)
f'% of titles mentioning Xi: {df["xi"].sum() / df.shape[0] * 100:.2f}%'

In [None]:
xi_perc_by_source = df.groupby("source")["xi"].mean().sort_values() * 100
xi_perc_by_source = xi_perc_by_source.to_frame("Perc").reset_index()

px.bar(
    xi_perc_by_source, x="Perc", y="source", template="plotly_white",
    labels=dict(Perc="%", source="Source"), 
    width=800, height=400, title="Percentage of Titles mentioning Xi by Sources",
    orientation="h"
)

In [None]:
xi_perc_by_date = df.groupby("date")["xi"] .mean().sort_values() * 100
xi_perc_by_date = xi_perc_by_date.to_frame("Perc").reset_index()
xi_perc_by_date["date"] = xi_perc_by_date["date"].dt.strftime("%Y-%m-%d")

px.bar(
    xi_perc_by_date, x="date", y="Perc", template="plotly_white",
    labels=dict(Perc="%", date="Date"), 
    width=800, height=400, title="Percentage of Titles mentiong Xi by Date",
    orientation="v"
)

A lot of Xi titles on Jan 3rd. We can take a look at what that's about:

In [None]:
df[(df.date == "2019-01-03") & df.xi][["title", "desc", "source"]].sample(5)

### Combine Two Plots
Originally there is a plot combining the number of mentions of Trump and Xi by source together in one plot, but I'm not sure the proper way to do it in Plotly Express. 

One way might be combing the two data frames and use "color" parameter to distinguish one from another:

In [None]:
xi_perc_by_source["poi"] = "Xi"
trump_perc_by_source["poi"] = "Trump"
combined = pd.concat([trump_perc_by_source, xi_perc_by_source], axis=0)
px.bar(
    combined, x="Perc", y="source", color="poi", template="plotly_white",
    labels=dict(Perc="%", source="Source"),
    width=800, height=400, title="Percentage of Title Mentions by Sources",
    orientation="h", barmode="group"
)

Another approach utilizing the facet parameters:

In [None]:
xi_perc_by_source["poi"] = "Xi"
trump_perc_by_source["poi"] = "Trump"
combined = pd.concat([trump_perc_by_source, xi_perc_by_source], axis=0)
px.bar(
    combined, x="Perc", y="source", template="plotly_white",
    labels=dict(Perc="%", source="Source"), facet_col="poi",
    width=800, height=600, title="Percentage of Title Mentions by Sources",
    orientation="h"
)

Looks more reasonable, but the spacing between axis and labels still needs some work.

## The End

This is the end of this simple starter notebook. Hopefully you'll find this dataset interesting.

There are some issues that was not covered here, but might be of interest to you:

1. The conversion of Traditional and Simplified Chinese.
2. The overabundance of contents from some sources. That could mean a lot of short breaking news pieces, or even just some low-quality contents. Might need some investigation.
3. The summary field of some sources are just truncated version of the full article. The final sentence are usually not complete. You might want to remove that sentence.