# Clean data

This code imports the data csv file and does the following:
1. Extracts only article text and label columns
2. Removes duplicate articles
3. Makes an index of non-english articles and removes them
4. Replaces all special characters present in the articles which space
5. Saves the clean data as df_article_text.csv in data folder

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from langdetect import detect

In [2]:
# Needs to be run once in the beginning
!pip install langdetect



In [3]:
# Read in the csv files
df_pos = pd.read_csv('data/positive_articles.csv', sep=',')
df_neg = pd.read_csv('data/negative_articles.csv', sep=',')

# Merge positive and negative data into one data frame
df = pd.concat([df_pos, df_neg], ignore_index=True)

# Subsetting label and article_text data
columns = ['label', 'article_text']
df_article_text = df[columns]

In [4]:
df_article_text.shape

(12859, 2)

Problems with the data:

1. Some articles non-english
2. '\n', 'â' characters present
3. Duplicate article_text- different urls but article is same.
4. Name of the author of the article present

In [5]:
# Removing duplicates in article_text the data
df_article_text.drop_duplicates(keep='first', inplace=True)

# Index of non-english rows
non_en_index = []

for index, row in df_article_text.iterrows():
    # Explicitly converting article_text to string because a few of the rows were being captured as non-strings
    lang = detect(str(row['article_text']))
    if lang != 'en':
        non_en_index.append(index)

# Removing non-english articles        
df_article_text.drop(non_en_index, inplace= True)

# Remove the special characters
df_article_text['article_text'] = [article.replace('\n', ' ') for article in df_article_text.article_text]
df_article_text['article_text'] = [article.replace('[^a-zA-Z\d\s:]', ' ') for article in df_article_text.article_text]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [6]:
# Write the clean data into a new csv
df_article_text.to_csv('data/df_article_text.csv', sep=',', index=False)