# Introduction

## CRISP - DM Method
### [Business Understanding](#business_understanding)
### [Data Understanding](#data_understanding)
   #### [Step 1: Load data into a dataframe](#load_data_df)
   #### [Step 2: Display the dimensions of the file](#display_data)
   #### [Step 3: what type of variables are in the table](#variable_types)
### Data Prep
   #### [Adding new columns from datetime](#adding_yr_mnth)
   #### [Milestone 2](#milestone_2)
   #### [Test data preparation](#test_data)
### Modeling
### Evaluation
### Deployment

<a id='business_understanding'></a>
## 1. Business Understanding

- This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.
- Can you categorize news articles based on their headlines and short descriptions?
- Do news articles from different categories have different writing styles?
- A classifier trained on this dataset could be used on a free text to identify the type of language being used.

<a id='data_understanding'></a>
## 2. Data Understanding

In [None]:
# Grouping / Classification of documents based on Text Analysis - Part 1 
## Graphics Analysis

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#check versions of packages
print('Python version:')  
!python --version
print('pandas version:', pd.__version__)
print('numpy version:', np.__version__)
# print('scikit-learn version:', sklearn.__version__)
# print('NLTK version:', nltk.__version__)

In [None]:
# Setting the parameters for the pandas dataframe

output_width = 1000
#output_width = 80 #//*** Normal Output width
pd.set_option("display.width", output_width)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

<a id='load_data_df'></a>
### Step 1: Load data into a dataframe

In [None]:
# This dataset contains the topics already marked and so I would like to use this 
# as training set for my model.
# I have a separate test data set for testing.

filename = "~/Documents/mygithub/bu_dsc/data/external/News_Category_Dataset_v2.json"
test_file = "~/Documents/mygithub/bu_dsc/data/external/global-issues.csv"

df_all = pd.read_json(filename, lines = True)
#display the first few rows of data
df_all.head()

<a id='display_data'></a>
### Step 2: Display the dimensions of the file

In [None]:
## so you’ll have a good idea the amount of data you are working with.
print("The dimension of the table is: ", df_all.shape)
print("Checking to see if there are any missing data: ")
df_all.info()

<a id='variable_types'></a>
### Step 3: what type of variables are in the table 

In [None]:
# Look at summary information about your data (total, mean, min, max, freq, unique, etc.) 
# Does this present any more questions for you? 
# Does it lead you to a conclusion yet?

print("Describe Data") 
print(df_all.describe()) 
print("Summarized Data") 
print(df_all.describe(include='O'))

<a id='adding_yr_mnth'></a>
### Breaking down the datetime column into year month for visualization  

In [None]:
# Breaking down the date column to Year and month separate columns for easy analysis
# the date column is having the data in the datetime format

df_all['year'] = df_all['date'].apply(lambda x: x.year)
df_all['month'] = df_all['date'].apply(lambda x: x.month)

df_all.head()

In [None]:
# df_all.columns
for col in df_all.columns[[0,6,7]]:
    print(col,len(df_all[col].unique()),df_all[col].unique())

In [None]:
# Display the dimensions of the dataframe post addition of the new columns
# We now see that it has 2 more columns added towards the end as year and month of type integer

print("The dimension of the table is: ", df_all.shape)
print("Checking to see if there are any missing data: ")
df_all.info()

In [None]:
# Barcharts: set up the figure size 
# This barchart shows the count of topics for each category.
# Creating a separate dataframe for the categories in the training set
# We can see that the Top 3 categories include Politics,Wellness and Entertainment and those are by far more than the rest.
# %matplotlib inline 

# In the X axis we are showing article categories
# The Y axis shows the corresponding counts 
width = 0.75 # the width of the bars

df_catg = df_all.groupby(['category'])['category'].count()

ax = df_catg.plot(kind='bar', figsize=(20,10), color="indigo",width = width)

plt.xticks(rotation=90)
ax.set_title('Category Count', fontsize=25) 
ax.set_xlabel('Category', fontsize=20)
ax.set_ylabel('Counts', fontsize=20)
ax.tick_params(axis='both', labelsize=15)

plt.show()


In [None]:
# Now subsetting the actual dataset
# Subsetting the dataset by category and year

subset_bycatgyear = df_all[['category','year']]
# This is an array of the years in the dataset
years = subset_bycatgyear['year'].unique()
years
# This shows an array of categories that is present in our dataset
# categories = subset_bycatgyear['category'].unique()
# categories


In [None]:
# Creating a grouping subset to be used in my analysis later

year_grp = subset_bycatgyear.groupby(['year'])

In [None]:
# displaying the data
# print('showing all the categories for a particular year:')
# year_grp.get_group(2012)
print('showing all the categories for a particular year:')
year_grp['category'].value_counts().loc[2018]

In [None]:
# Creating a dataframe out of the grouping set

df = pd.DataFrame(year_grp['category'].value_counts())
df.unstack(level=1)

In [None]:
# getting the top 3 categories for each year



In [None]:
# Grouped barcharts: set up the figure size 
# This barchart shows the count of topics for each category.
# We can see that the Top 3 categories include Politics,Wellness and Entertainment and those are by far more than the rest.
# %matplotlib inline 

# In the X axis we are showing article categories
# The Y axis shows the corresponding counts 
width = 0.50 # the width of the bars
plt.rcParams['figure.figsize'] = (20, 10)

# Get the data
# df = subset_bycatgyear.groupby(['year','category'])['category'].count()
df1 = df.unstack(level=1)

ax = df1.plot(kind='bar', width = width)

# Define the bar


# make the bar plot
plt.xticks(rotation=0)
ax.set_title('Category Count', fontsize=25) 
ax.set_xlabel('Year', fontsize=20)
ax.set_ylabel('Counts', fontsize=20)
# ax.tick_params(axis='both', labelsize=15)

# plt.show()

In [None]:
'''
Now that you have created your idea, located data, and have started your graphical analysis, we need to shift to starting the dimensionality/feature reduction and feature engineering steps of the project.

It is important to note that these are milestones, meant to keep you on track for the final project submission. At any point, you can pivot or modify your project as needed based on what you discover. These milestones are not final versions; they are drafts of the many steps you need to complete along the way.

In Milestone 2, you should drop any features that are not useful for your model building. You should explain and justify why the feature dropped is not useful. You should address any missing data issues. Build any new features that you need for your model, e.g., create dummy variables for categorical features if necessary. Explain your process at each step. You can use any methods/tools you think are most appropriate. Again, keep in mind that this may look very different from what is done in the Titanic tutorial case study. You should do what makes sense for your project. Be careful to avoid data snooping in these steps.

As a reminder – Teams is a great place to discuss your project with your peers. Feel free to solicit feedback/input (without creating a group project!) and collaborate on your projects with your peers.

Each milestone will build on top of each other, so make sure you do not fall behind.

Submit Milestone 2 as a PDF or Jupyter Notebook, along with any applicable code to the submission link.
'''

<a id='milestone_2'></a>
## Milestone 2

In [None]:
#display the first few rows of data
df_all.head()


In [None]:
# Display the dimensions of the dataframe post addition of the new columns
# We now see that it has 2 more columns added towards the end as year and month of type integer

print("The dimension of the table is: ", df_all.shape)
print("Checking to see if there are any missing data: ")
df_all.info()

### Description of the data set
The training data set contains 200,853 rows and they represent the articles from huffington post on different topics.

It has 8 columns with 7 features as the "category" column represents the target vector. We would be using this as a supervised learning algorithm.

The model would be using the headline and the short description feature to categorize the headlines. 

The year and month features were added later and is a modified feature. So these 2 features could be dropped from the data set for model purpose. These 2 features would not add any additional information. So as part of the Feature selection process I am dropping them.

The link feature shows the internet link of the headline. It shows a weblink to the news item and hence would not add any additional information for the categorization.

As this project relates to text analysis so I am for the time being not considering any other features from the dataset other than the text values.


In [None]:
# In the next step I am creating a new training dataframe df_train and the target vector as y
# 
df_train = df_all[['headline','short_description']]
df_train.info() 

In [None]:
# The target vector 
y = df_all['category']
print ('Length of the target vector: ' , len(y))
print ('Size of the vector: ',y.size)

In [None]:
## In the next section I am working on cleaning the dataset and word tokenization.

In [None]:
#Convert text to lowercase and romove punctuation
#define a function to clean the text

# import the required libraries here

#import regular expressions library
import re

def clean_text(text):
    """
    Remove punctuations and special characters, makes lower case
    Args: text
    Output: text
    """
    text=text.lower() #makes text lowercase
    text=re.sub('\\d|\\W+|_',' ',text) #removes extra white space
    text=re.sub('[^a-zA-Z]'," ", text) #removes any non-alphabetic characters
    return text

In [None]:
#import word tokenizer from NLTK
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    return word_tokenize(text)

In [None]:
# Testing the functions
input_txt = "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV"
output_txt = clean_text(input_txt)
output_txt

In [None]:
#create a new data frame for the column for each pre-processing step
#take a random sample of the dataframe to cut down on processing time
#number of comments to keep
num_comments = 5
df_sample_headline = df_train.sample(n = num_comments).reset_index(drop = True )
df_sample_headline.head()

In [None]:
## Applying the cleaning for the headline part

#apply text cleaning function
df_sample_headline['hdln_clean'] = df_sample_headline['headline'].apply(clean_text)
#apply tokenizing
df_sample_headline['hdln_tokenized'] = df_sample_headline['hdln_clean'].apply(tokenize_text)
#apply PorterStemmer function
# df_sample['txt_stemmed'] = df_sample['txt_tokenized'].apply(stem_text)
#put the text back together (untokenize)
df_sample_headline['hdln_final'] = df_sample_headline['hdln_tokenized'].apply(lambda text: ' '.join(text))
#view the pre-processed text
print('Show the dimension of the new dataframe: ', df_sample_headline.shape)
df_sample_headline.head()

<a id='test_data'></a>

### I am loading the below dataset as the test dataset
### Although this dataset as I downloaded from kaggle site did have some classification data but I will tend to avoid it and use it to predict the cluster


In [None]:
df_test = pd.read_csv(test_file)
df_test.info()

In [None]:
df_test.head()