# Pre-processing for data mining
## Introduction
With this notebook, you are provided an exported SQL file data_raw.sql, a record of news content from different sources.

The intial table had the following clumns: `ID`, `post_author`, `post_date`, `post_date_gmt`, `post_content`, `post_title`, `post_excerpt`, `post_status`, `comment_status`, `ping_status`, `post_password`, `post_name`, `to_ping`, `pinged`, `post_modified`, `post_modified_gmt`, `post_content_filtered`, `post_parent`, `guid`, `menu_order`, `post_type`, `post_mime_type`, `comment_count`


Below are two examples of the recorded data:
- example 1:

> (6189, 0, '2015-08-27 15:28:37', '2015-08-27 15:28:37', 'DHL  : amélioration de la logistique des transports (NewsMada)', 'DHL  : amélioration de la logistique des transports (NewsMada)', 'DHL  : amélioration de la logistique des transports (NewsMada)', 'inherit', 'open', 'closed', '', 'dhl-amelioration-de-la-logistique-des-transports-newsmada-2', '', '', '2015-08-27 15:28:37', '2015-08-27 15:28:37', '', 6188, 'http://example.com/wp-content/uploads/2015/08/DHL- -amélioration-de-la-logistique-des-transports-NewsMada.png', 0, 'attachment', 'image/png', 0)

- example 2:


> (6190, 1, '2015-08-26 09:19:57', '2015-08-26 09:19:57', ' [ad_1]\r\n<br><div id=\"\"><p style=\"text-align: justify;\">En collaboration avec la région Vakinankaratra, le Centre international de recherches agronomiques pour le développement (Cirad) et l’Institut international des sciences sociales (IISS), ayant trait à la prospective territoriale et locale, l’Agence française de développement (AFD) ont organisé récemment dans la ville d’Eaux un atelier sur la prospective territoriale participative. D’après le chef de région Mandrindra Andrianjanaka, l’objectif est d’expérimenter une nouvelle approche territoriale, en plus de l’approche sectorielle dont on avait l’habitude auparavant. 25 personnes disposant à titre personnel de connaissances complémentaires en la matière ont participé à cette rencontre.</p>\n<p style=\"text-align: justify;\">Il s’agissait en l’occurrence d’intégrer les dynamiques démographiques dans les stratégies de développement en se projetant sur une période de 20 ans, plus exactement jusqu’en 2035. Une telle action permet ainsi d’avoir une vision globale de l’évolution future de la région Vakinankaratra et d’identifier les forces qui permettraient d’influencer son développement. Les cinq jours d’atelier ont ainsi fait ressortir différents scénarii se rapportant à la vision fixée  de 2035. Des résultats qui se veulent un outil de prise de décision pour le développement de la région.</p>\n<p style=\"text-align: justify;\">La région Vakinankaratra est donc honorée d’avoir été choisie par l’AFD qui a déjà initié une recherche expérimentale sur plusieurs territoires ruraux d’Afrique du même genre, en promouvant une démarche participative basée sur l’implication des acteurs du territoire concerné. Mandrindra Andrianjanaka, dans son discours de clôture de l’atelier a souligné que les résolutions prises allaient être réellement prises en considération.</p>\n<p style=\"text-align: right;\"><strong>Jeannot Ratsimbazafy</strong></p>\n\n<section id=\"text-5\" class=\"widget widget_text\"/><!-- #comments -->		</div>\r\n<br>[ad_2]\r\n<br><a href=\"http://www.newsmada.com/2015/08/26/antsirabe-a-lheure-de-la-prospective-territoriale-participative/\">Source link </a>', 'Antsirabe : à l’heure de la prospective territoriale participative (NewsMada)', '', 'publish', 'open', 'open', '', 'antsirabe-a-lheure-de-la-prospective-territoriale-participative-newsmada', '', '', '2015-08-26 09:19:57', '2015-08-26 09:19:57', '', 0, 'http://example.com/antsirabe-a-lheure-de-la-prospective-territoriale-participative-newsmada/', 0, 'post', '', 0)

In this notebook, you will pre-process this data as part of a data mining pipeline. Your task will be completed when the required raw texts are extracted and classifed whether being in French or others.

- **Preprocess** 
    - You'll extract the news content from author 1 as clean text (without HTML tags)
    - You will create a new CSV file raw_data.csv that contains everything you collected from the previous step plus the post_date_gmt and the ```Source``` link. E.g., http://www.newsmada.com/2015/08/26/antsirabe-a-lheure-de-la-prospective-territoriale-participative in example 2. You will set missing values to None. You will also add a new column for the ```Source domain``` which is the root domain name of each source link.

- **Models**
    - You will a function process which accepts a string (text content) as input and returns a probability of content being in French.
    - You will run process() on each clean news content from raw_data.csv.
    - You will create a new CSV file data.csv, a copy of raw_data.csv with a new column value to specify the language being used represented by 1 when french is used and 0 overwise.

- **Prediction**: 
    It is essential to note that news content containing only a few French words (names, etc.) should not be considered french news content.



In [None]:
# import module
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import *
from sklearn.metrics import *

from io import StringIO
import re


In [None]:
# TODO 1: Loading the data
# 读取sql有三种方法：内置open、pymysql、pandas
sql_file = "/kaggle/input/news-with-french/data_raw.sql"
def read_sql():
    sql = open(sql_file, "r", encoding = "utf8")
    sqltxt = sql.readlines()
    sql.close()
    print(sqltext[0]) #难以分
# read_sql()
# def read_sql_script_all(sql_file_path, quotechar="'") -> (str, dict):

def read_sql_script_all(sql_file_path, quotechar="'"):
    insert_check=re.compile("insert +into +`?(\w+?)`?\(", re.I|re.A)
    with open(sql_file_path, encoding="utf-8") as f:
        sql_txt=f.read()
    print(len(sql_txt))
    end_pos = -1
    df_dict = {}
    while True:
        match_obj = insert_check.search(sql_txt, end_pos+1)
        print(match_obj)
        if not match_obj: 
            break
        table_name = match_obj.group(1)
        start_pos = match_obj.span()[1]+1
        end_pos = sql_txt.find(";", start_pos)
        tmp = re.sub(r"\)( values |,)\(","\n",sql_txt[start_pos:end_pos])
        tmp =re.sub(r"[`()]","",tmp)
        df=pd.read_csv(StringIO(tmp),quotechar=quotechar)
        dfs=df_dict.setdefault(table_name,[])
        dfs.append(df)
    for table_name, dfs in df_dict.items():
        df_dict[table_name]=pd.concat(dfs)
    return df_dict

read_sql_script_all(sql_file)

In [None]:
# TODO 2: Generating the raw_data.csv

# TODO 3: Your model for language detection or identification
- Some ideas about natural language processing [link](https://www.zhihu.com/question/356132676)

In [None]:
French_text = "/kaggle/input/news-with-french/Paris et Londres en 1793.txt"
English_text = "/kaggle/input/news-with-french/A Tale of Two Cities.txt"
raw_french  = open(French_text,encoding='utf8').read()
raw_english  = open(English_text,encoding='utf8').read()
raw_english[:500]

为了解决分句的问题，我首先尝试基于ptyhon的NLTK套件来，发现默认的算法并不完善，或者说无法适应各种不同目的的需求，还是需要后续处理。比如NLTK、textstats等默认把无标点结尾而是回车符的行末统一替换为\n，全部当作同一个句子，比如书籍前边的目录，这样的话就产生了一些不合理的超长句子。

In [None]:
#分词：https://zhuanlan.zhihu.com/p/242247311
def sentence_split_nltk(strs):
#     import nltk
#     nltk.download()

    from nltk.tokenize import sent_tokenize
    sent_tokenize_list = sent_tokenize(strs)
    return sent_tokenize_list

def sentence_split(strs):
    strs = strs.replace("!",".").replace("?",".").replace("\t"," ").replace("\n"," ")
    return strs.split(".")
sen_en = sentence_split(raw_english) #https://www.dtmao.cc/news_show_1850208.shtml
sen_french = sentence_split(raw_french)
x = sen_en + sen_french
y = ['en']*len(sen_en) + ['fe']*len(sen_french)

- [How to use CountVectorizer](https://blog.csdn.net/m0_37788308/article/details/80933915) 包括中文

In [None]:
def test_Count():
    a ="自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学"
    b = "因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系，但又有重要的区别。自然语言处理并不是一般地研究自然语言，而在于研制能有效地实现自然语言通信的计算机系统，特别是其中的软件系统。"
    c ="因而它是计算机科学的一部分。自然语言处理（NLP）是计算机科学，人工智能，语言学关注计算机和人类（自然）语言之间的相互作用的领域。"
    import jieba
    all_list= ['  '.join(jieba.cut(s,cut_all = False)) for s in [a,b,c]]
    # print((all_list)[0])
    count_vec=CountVectorizer()
    count_vec.fit_transform([a,b,c]).toarray()
    print('\nvocabulary list:\n\n',count_vec.get_feature_names())
    print( '\nvocabulary dic :\n\n',count_vec.vocabulary_)
    print ('vocabulary:\n\n')
    for key,value in count_vec.vocabulary_.items():
        print(key,value)

### [The method of building the model](https://zhuanlan.zhihu.com/p/27447133)(traditional)

In [None]:
def remove_noise(document):
    noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
    clean_text = re.sub(noise_pattern, "", document)
    return clean_text.strip()
remove_noise("Trump images are now more popular than cat gifs. @trump #trends http://www.trumptrends.html")

define model

In [None]:
bigram_count = CountVectorizer(ngram_range=(2,2),analyzer='char_wb',#
    max_features=1000,  # keep the most common 1000 ngrams
    preprocessor=remove_noise
)
bigram_count
# https://www.dtmao.cc/news_show_1850208.shtml
pipline = Pipeline([("vectorzier",bigram_count),("model",MultinomialNB())])
pipline

train and test model

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1024)
pipline.fit(x_train,y_train)
y_prd = pipline.predict(x_test)
print(confusion_matrix(y_test,y_prd))
print(classification_report(y_test,y_prd))
def predict_sentence(sentence:str):
    y_prd = pipline.predict([sentence])
    return y_prd[0]
def predict_article(article:str):
    y_prd = list(pipline.predict(sentence_split(article)))
#     print(y_prd)
    c = max(y_prd,key=y_prd.count)
    return c=='fe'
predict_article("J'utilise souvent SQL dans mon travail, et il existe de nombreuses nuances et limitations ennuyeuses, mais en dernière analyse, c'est la pierre angulaire de l'industrie des données. Par conséquent, pour chaque travailleur dans le domaine des données, SQL est indispensable. La maîtrise de SQL est d'une grande importance.")

In [None]:
# TODO 4: Generating the data.csv

In [None]:
df = pd.read_csv("/kaggle/input/news-with-french/raw_data.csv")
# 两个清除空行方法np的isnan和df.dropna()
# df = df[~np.isnan(df['clean_text'])] pandas的问题nan算为float
# df['clean_text'][69].map(predict_article)
# float报错通过找到报错项 得知[df['clean_text'][69]] nan是一个float，还不能用isnan等去去除//只能用pd的方法才能判断
df = df[~pd.isnull(df['clean_text'])] 
df['Class'] = df['clean_text'].map(predict_article)
df.head()

In [None]:
df.to_csv("./data.csv",index=None)

Answer the following questions:

- What is the total number of french news articles from the generated data.csv
- How many French articles a month were published on average?
- Make a visualization of the number of French articles published daily based on data.csv
- Which period (lasting 15 days) had the most French articles publication?
- Make a visualization that compares the daily article publications in French and Other languages (s) for the whole data in data.csv
- How many unique source domains are there that in total?
- How many unique source domains are writing content in French only.
- Visualize the number of French articles published in the different source domains per year.

- What is the total number of french news articles from the generated data.csv

In [None]:
df = pd.read_csv("./data.csv")
df_french = df[df['Class']==True]
len(df_french)

- How many French articles a month were published on average?

In [None]:
# 1.parse datetime
import datetime as dt
from datetime import datetime
# df_french['post_data_gmt'] = df_french['post_data_gmt'].map(lambda x:x.strftime('%Y-%m-%d %H:%M:%S'))
def parse_ymd(s): #速度比datetime自带函数快
    year_s, mon_s, day_s, hour_s, min_s, second_s= s.replace(":","-").replace(" ","-").split('-')
    return datetime(int(year_s), int(mon_s), int(day_s),int(hour_s),int(min_s),int(second_s))
df_french['post_data_gmt'] = df_french['post_data_gmt'].map(parse_ymd)
df_french = df_french.set_index('post_data_gmt')
# 2.set a col to mark month
df_french['month'] = df_french.index.map(lambda x:x.month)
# 3.get series and plot
month_count = df_french.groupby('month').agg('count')['Class']
month_count.plot(kind='bar')
print(f"every month publish {int(month_count.mean())} on average")

- Make a visualization of the number of French articles published daily based on data.csv

In [None]:
# set a col to mark day
df_french['day'] = df_french.index.map(lambda x:x.date)
# get series and plot
day_count = df_french.groupby('day').agg('count')['Class']
day_count.plot()
print(f"every day publish {int(day_count.mean())} on average")

- Which period (lasting 15 days) had the most French articles publication?

In [None]:
# use between_time() to get a period
# df_french.between_time("06:00", "22:00") #只能是time所以不行
max_pub = 0
max_sday = df_french.index[0]
for i in df_french.index:
#     days = len(df_french[i.strftime('%Y-%m-%d %H:%M:%S'):(i+dt.timedelta(days=15)).strftime('%Y-%m-%d %H:%M:%S')])
    start = df_french.index.searchsorted(i)
    end = df_french.index.searchsorted(i+dt.timedelta(days=15))
    days = len(df_french[start:end])
    if max_pub < days: 
        max_pub = days
        max_sday = i
print(f"between {max_sday} to {max_sday+dt.timedelta(days=15)} had published {max_pub} articles")

- Make a visualization that compares the daily article publications in French and Other languages (s) for the whole data in data.csv

In [None]:
df_other = df[df['Class']==False]
df_other['post_data_gmt'] = df_other['post_data_gmt'].map(parse_ymd)
df_other = df_other.set_index('post_data_gmt')
df_other['day'] = df_other.index.map(lambda x:x.date)
# get series and plot
import seaborn as sns; sns.set()
%matplotlib inline
day_count_other = df_other.groupby('day').agg('count')['Class']
day_count.name = 'Other'
day_count_other.name = 'French'

In [None]:
dd=[day_count_other,day_count]
ax = sns.lineplot(data=dd)
plt.show()