# What to Expect
This notebook wil demo various machine learning and deep learning techniques for text classification problems. Initially, I will explore various datasets before I pick one to deep dive into. The first dataset that I will focus on is scientific paper reviews' data. If the quality of data is good enough, I will carry out sentiment analysis on it using various algorithms. Otherwise I will explore another data set.

# Initial Data Exploration
The main goal of this step is to skim different datasets containing text, examine data quality and finally select the dataset that I will use for demonstrating various machine learning algorithms in action. The main steps will be:


*   Load the dataset into local memory
*   Parse the file into the desired format like a pandas dataframe (this might be a multistep process depending on the level of granularity we need)
*   Examine the basics like the main features, missing values etc
*   If data quality is satisfactory, proceed to machine learning, else move on to exploration of the next dataset






In [1]:
#Loading the first dataset
!wget 'https://archive.ics.uci.edu/ml/machine-learning-databases/00410/reviews.json'

--2020-07-06 18:08:41--  https://archive.ics.uci.edu/ml/machine-learning-databases/00410/reviews.json
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593600 (580K) [application/x-httpd-php]
Saving to: ‘reviews.json’


2020-07-06 18:08:41 (3.20 MB/s) - ‘reviews.json’ saved [593600/593600]



In [52]:
#Importing python libraries
import numpy as np
import pandas as pd #to work with csv files

#matplotlib imports are used to plot confusion matrices for the classifiers
import matplotlib as mpl 
import matplotlib.cm as cm 
import matplotlib.pyplot as plt 

#import feature extraction methods 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words

#pre-processing of text
import string
import re

#import classifiers 
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

#import different evaluation metrics 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 
from sklearn import metrics

#import time function to track the training duration
from time import time

The json file is nested. Hence, I will visualize all levels (top down approach)

In [167]:
#Explore dataset (top most level in the nested file)
df = pd.read_json(r'reviews.json')

In [168]:
df.loc[0,'paper'] #The json file ahs 3 main keys id, preliminary decision and review

{'id': 1,
 'preliminary_decision': 'accept',
 'review': [{'confidence': '4',
   'evaluation': '1',
   'id': 1,
   'lan': 'es',
   'orientation': '0',
   'remarks': '',
   'text': '- El artículo aborda un problema contingente y muy relevante, e incluye tanto un diagnóstico nacional de uso de buenas prácticas como una solución (buenas prácticas concretas). - El lenguaje es adecuado.  - El artículo se siente como la concatenación de tres artículos diferentes: (1) resultados de una encuesta, (2) buenas prácticas de seguridad, (3) incorporación de buenas prácticas. - El orden de las secciones sería mejor si refleja este orden (la versión revisada es #2, #1, #3). - El artículo no tiene validación de ningún tipo, ni siquiera por evaluación de expertos.',
   'timespan': '2010-07-05'},
  {'confidence': '4',
   'evaluation': '1',
   'id': 2,
   'lan': 'es',
   'orientation': '1',
   'remarks': '',
   'text': 'El artículo presenta recomendaciones prácticas para el desarrollo de software seguro. S

In [169]:
#Extracting the main features of each research paper
paper_df = pd.json_normalize(df["paper"])


In [170]:
paper_df.head()

Unnamed: 0,id,preliminary_decision,review
0,1,accept,"[{'confidence': '4', 'evaluation': '1', 'id': ..."
1,2,accept,"[{'confidence': '4', 'evaluation': '2', 'id': ..."
2,3,accept,"[{'confidence': '4', 'evaluation': '2', 'id': ..."
3,4,accept,"[{'confidence': '4', 'evaluation': '2', 'id': ..."
4,5,accept,"[{'confidence': '4', 'evaluation': '2', 'id': ..."


In [172]:
#Extracting the main features of review per research paper
review_df = pd.json_normalize(paper_df["review"][0])

In [173]:
review_df.head()

Unnamed: 0,confidence,evaluation,id,lan,orientation,remarks,text,timespan
0,4,1,1,es,0,,- El artículo aborda un problema contingente y...,2010-07-05
1,4,1,2,es,1,,El artículo presenta recomendaciones prácticas...,2010-07-05
2,5,1,3,es,1,,- El tema es muy interesante y puede ser de mu...,2010-07-05


In [179]:
#Extracting the main features that I am primarily interested in and combining into final df
final_df =pd.concat([review_df.text, paper_df.preliminary_decision,review_df.confidence], axis=1)

In [188]:
final_df.preliminary_decision.value_counts()

accept             115
reject              48
probably reject      7
no decision          2
Name: preliminary_decision, dtype: int64