**The main goal of this kernel is to make the results comprensehible**. If we create a plot to show the words most used in the news but we can't understand them because they are in chinese, then our work has been for nothing. Remember: the main goal of Data Analysis is to extract useful conclusions from the data.  

I know a little Chinese so that's why this dataset interested me. 

# 1-Inspection of the dataset

In [None]:
#libraries I'm going to use
import pandas as pd
import numpy as np
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

In [None]:
dataset=pd.read_csv('../input/chinese-official-daily-news-since-2016/chinese_news.csv')

In [None]:
dataset.head()

The columns of the dataset are:

* Date. When the news were published.

* Tag. The topic of what are the news about.

* Headline. The headline.

* Content. The text in which the news are detailed explained.

Let's check the possible topics the news can belong to.

In [None]:
dataset.tag.unique()

* 国内 (guónèi) : these characters together literally mean 'inside country'. The correct translation is 'domestic'.
* 国际(guójì) : this means 'international'.
* 详细全文(xiángxì quánwén): this means 'full text'.

In [None]:
#I'm going to replace the tags by their English version
dataset['tag'] = dataset['tag'].str.replace('国内', 'domestic news')
dataset['tag'] = dataset['tag'].str.replace('国际', 'international news')
dataset['tag'] = dataset['tag'].str.replace('详细全文', 'detailed news')

#and create a new column year
dataset['year'] = pd.DatetimeIndex(dataset['date']).year
#change the type of the column to string
dataset['year'] = dataset['year'].apply(str)
#unique values of the new column 'year'
print(dataset['year'].unique())
#make a new dataset counting the number of news by year
d2=dataset.groupby('year').size().reset_index(name='count')

In [None]:
fig = px.bar(d2,x='year',y='count',title='Amount of news by year',color='year',color_discrete_map={'2016': '#e8e3cc', 
                                                   '2017': '#d7a449', '2018': '#db3f29'})

fig.update_layout(
   paper_bgcolor='#0b1f65',
   plot_bgcolor='#0b1f65',
    font_family="Arial",
    font_color="white",
    title_font_family="Arial",
    title_font_color="white",
    legend_title_font_color="white",
    xaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    'tickformat': 'd'
    },
    yaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    'title': 'Amount of news'
    }
    
)




# plot
fig.show()

In [None]:
#Amounts of news by tag
d3=dataset.groupby('tag').size().reset_index(name='count')


fig = px.bar(d3,x='tag',y='count',title='News by tag',color='tag',color_discrete_map={'detailed news': '#e8e3cc', 
                                                   'domestic news': '#d7a449', 'international news': '#db3f29'})

fig.update_layout(
   paper_bgcolor='#0b1f65',
   plot_bgcolor='#0b1f65',
    font_family="Arial",
    font_color="white",
    title_font_family="Arial",
    title_font_color="white",
    legend_title_font_color="white",
    xaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    'tickformat': 'd'
    },
    yaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    'title': 'Amount of news'
    }
    
)




# plot
fig.show()




In [None]:
#news by month
dataset['month'] = pd.DatetimeIndex(dataset['date']).month
dataset['month'] = dataset['month'].apply(str)
d4=dataset.groupby('month').size().reset_index(name='count')

colors = ['crimson',] * 12

fig = go.Figure(data=[go.Bar(
    x=d4.month,
    y=d4['count'],
    marker_color=colors
)])

fig.update_layout(
   paper_bgcolor='#0b1f65',
   plot_bgcolor='#0b1f65',
    font_family="Arial",
    font_color="white",
    title_font_family="Arial",
    title_font_color="white",
    legend_title_font_color="white",
    xaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    
    },
    yaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    'title': 'Amount of news'
    }
    
)




# plot
fig.show()

There are more news published during May, August, September and March.

# 2-Analysis of the text



In [None]:
pip install -U spacy

In [None]:
import spacy
from spacy import displacy
from spacy.lang.zh import Chinese
# Disable jieba to use character segmentation
Chinese.Defaults.use_jieba = False
nlp = Chinese()

# Disable jieba through tokenizer config options
cfg = {"use_jieba": False}
nlp = Chinese(meta={"tokenizer": {"config": cfg}})
# Load with "default" model provided by pkuseg
cfg = {"pkuseg_model": "default", "require_pkuseg": True}
nlp = Chinese(meta={"tokenizer": {"config": cfg}})

In [None]:
#filter the dataset to get only the international news
international=dataset[dataset['tag']=='international news']
international = international.reset_index(drop=True)

In [None]:
nlp = Chinese()
d1=nlp(international['headline'][3])
d1

In [None]:
tokenized_text = pd.DataFrame()
#describe the words in the sentence before
for i, token in enumerate(d1):
    tokenized_text.loc[i, 'text'] = token.text
    tokenized_text.loc[i, 'type'] = token.pos_
    tokenized_text.loc[i, 'lemma'] = token.lemma_,
    tokenized_text.loc[i, 'is_alphabetic'] = token.is_alpha
    tokenized_text.loc[i, 'is_stop'] = token.is_stop
    tokenized_text.loc[i, 'is_punctuation'] = token.is_punct
    tokenized_text.loc[i, 'sentiment'] = token.sentiment
    
    

tokenized_text[:10]

Here comes the problem of tokenize chinese words. Let's look at the first word in the sentence: 以色列 (Yǐsèliè). This means Israel. The tokenizer, as we can see in the table before,is splitting this word in three characters. To make the names of foreign countries, Chinese people usually forms the word with characters that sound similar, in this case: 以(use/by/for),色(colour/color/expression) and 列(list/rank/category). Chinese words are composed by characters and all these characters have an individual meaning by their own. That's the big problem here.

Then, how can we count the most frequent words in the headlines?

I have used Googletrans (a free Python library that implemented Google Translate API) to translate the headlines to English. 



In [None]:
pip install googletrans

In [None]:
import googletrans
from googletrans import Translator

translator = Translator()
# available languages for translation
print(googletrans.LANGUAGES)

In [None]:
cols=international['headline']
translations = []
for column in cols:
    translations.append(translator.translate(column).text)
    
    


In [None]:
international['headlineEnglish'] = translations

In [None]:
import nltk
from collections import Counter

# Create a list of stopwords
stopwords = nltk.corpus.stopwords.words('english')
# Create a list of punctuation marks
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))


words = (international.headlineEnglish
           .str.lower()
           .str.cat(sep=' ')
           .split()
        )

l=[]

for i in words:
    if i not in RE_stopwords:
        l.append(i)
        



In [None]:
mostFrequentWords = pd.DataFrame(Counter(l).most_common(40),
                    columns=['Word', 'Frequency'])

mostFrequentWords.head(10)

In [None]:
fig = px.bar(mostFrequentWords,
             x='Frequency',
             y='Word',
             title='The 40 words most mentioned in the headlines (International news)',
             color='Frequency',
             barmode='stack')

fig.update_layout(
   paper_bgcolor='#0b1f65',
   plot_bgcolor='#0b1f65',
    font_family="Arial",
    font_color="white",
    title_font_family="Arial",
    title_font_color="white",
    legend_title_font_color="white",
    xaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    
    },
    yaxis = { 
    'showgrid': False, 
    'zeroline': True, 
    'visible': True,
    
    }
    
)




# plot
fig.show()

The topics of the international news are United States, Russia, Syria, Korea, Iran, etc.

In [None]:
d = {}
for a, x in mostFrequentWords.values:
    d[a] = x

import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(background_color='#e8e3cc',max_font_size = 86, random_state = 42)
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure(figsize=[12, 8])
figure_size=(24.0,16.0)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()