### Student Information
Name: 黃暐喬 Huang Wei Chiao

Student ID:  0756527 (NCTU)

GitHub ID:   weichiao-cs07

---

### Instructions

- First, you should attempt the **take home** exercises provided in the [notebook](https://github.com/omarsar/data_mining_lab/blob/master/news_data_mining.ipynb) we used for the first lab session. Attempt all the exercises, as it is counts towards the final grade of your first assignment (20%). 

- Then, download the dataset provided in this [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). The sentiment dataset contains a `sentence` and `score` label. Read the specificiations of the dataset before you start exploring it. 


- Then, you are asked to apply each of the data exploration and data operation steps learned in the [first lab session](https://github.com/omarsar/data_mining_lab) on **the new dataset**. You don't need to explain all the procedures as we did in the notebook, but you are expected to provide some **minimal comments** explaining your code. You are also expected to use the same libraries used in the first lab session. You are allowed to use and modify the `helper` functions we provided in the first lab session or create your own. Also, be aware that the helper functions may need modification as you are dealing with a completely different dataset. This part is worth 30% of your grade!

- In addition to applying the same operations from the first lab, we are asking that you attempt the following tasks on the new sentiment dataset as well (40%):
    - Use your creativity and imagination to generate **new data visualizations**. Refer to online resources and the Data Mining textbook for inspiration and ideas. 
    - Generate **TF-IDF features** from the tokens of each text. Refer to this Sciki-learn [guide](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) on how you may go about doing this. Keep in mind that you are generating a matrix similar to the term-document matrix we implemented in our first lab session. However, the weights will be computed differently and should represent the TF-IDF value of each word per document as opposed to the word frequency.
    - Using both the TF-IDF and word frequency features, try to compute the **similarity** between random sentences and report results. Read the "distance simiilarity" section of the Data Mining textbook on what measures you can use here. [Cosine similarity](https://jamesmccaffrey.wordpress.com/2017/03/29/the-cosine-similarity-of-two-sentences/) is one of these methods but there are others. Try to explore a few of them in this exercise and report the differences in result. 
    - Lastly, implement a simple **Naive Bayes classifier** that automatically classifies the records into their categories. Try to implement this using scikit-learn built in classifiers and use both the TF-IDF features and word frequency features to build two seperate classifiers. Refer to this [nice article](https://hub.packtpub.com/implementing-3-naive-bayes-classifiers-in-scikit-learn/) on how to build this type of classifier using scikit-learn. Report the classification accuracy of both your models. If you are struggling with this step please reach us on Slack as soon as possible.   


- Presentation matters! You are also expected to **tidy up your notebook** and attempt new data operations and techniques that you have learned so far in the Data Mining course. Surprise us! This segment is worth 10% of your grade. The idea of this exercise is to begin thinking of how you will program the concepts you have learned and the process that is involved. 


- After completing all the above tasks, you are free to remove this header block and **submit** your assignment following the guide provided in the [README.md](https://github.com/omarsar/dm_2018_hw_1/blob/master/README.md) file of the assignment's repository. 

In [1]:
### Begin Assignment Here!
# import library
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import plotly.plotly as py
import plotly.graph_objs as go
import math
import os
import helpers.data_mining_helpers as dmh
%matplotlib inline


## 1. Loading Data and Converting into Pandas Dataframe  
__---from sentiment_labelled_sentences from UCI ML dataset__

In [2]:
#dir = "/Users/ericahuang/Desktop/git/Github/dm_2018_hw_1"
# prepare dataset
here = os.path.dirname(__file__) if "__file__" in locals() else "."

files = [("amazon", os.path.join(here, "./data/sentiment_labelled_sentences/amazon_cells_labelled.txt")),
         ("imdb", os.path.join(here, "./data/sentiment_labelled_sentences/imdb_labelled.txt")),
         ("yelp", os.path.join(here, "./data/sentiment_labelled_sentences/yelp_labelled.txt"))]
dfs = []             

for provider, name in files:
    df = pd.read_csv(name, sep="\t")
    df.columns = ["sentence", "label"]
    df["provider"] = provider
    dfs.append(df)

senti_label = pd.concat(dfs, axis=0)
senti_label


Unnamed: 0,sentence,label,provider
0,"Good case, Excellent value.",1,amazon
1,Great for the jawbone.,1,amazon
2,Tied to charger for conversations lasting more...,0,amazon
3,The mic is great.,1,amazon
4,I have to jiggle the plug to get it to line up...,0,amazon
5,If you have several dozen or several hundred c...,0,amazon
6,If you are Razr owner...you must have this!,1,amazon
7,"Needless to say, I wasted my money.",0,amazon
8,What a waste of money and time!.,0,amazon
9,And the sound quality is great.,1,amazon


In [3]:
#size of the dataframe
print(senti_label.shape)
#what is the type of this data
print (type(senti_label))
print(senti_label[-5:])

(2745, 3)
<class 'pandas.core.frame.DataFrame'>
                                              sentence  label provider
994  I think food should have flavor and texture an...      0     yelp
995                           Appetite instantly gone.      0     yelp
996  Overall I was not impressed and would not go b...      0     yelp
997  The whole experience was underwhelming, and I ...      0     yelp
998  Then, as if I hadn't wasted enough of my life ...      0     yelp


## 2. Query 
__using __iloc__、__loc__ built-in function__


In [4]:
senti_label.provider[:2]

0    amazon
1    amazon
Name: provider, dtype: object

In [5]:
senti_label.sentence[0:2]

0    Good case, Excellent value.
1         Great for the jawbone.
Name: sentence, dtype: object

In [6]:
senti_label[0:100:10]

Unnamed: 0,sentence,label,provider
0,"Good case, Excellent value.",1,amazon
10,He was very impressed when going from the orig...,1,amazon
20,I bought this to use with my Kindle Fire and a...,1,amazon
30,This product is ideal for people like me whose...,1,amazon
40,I was not impressed by this product.,0,amazon
50,good protection and does not make phone too bu...,1,amazon
60,I really recommend this faceplates since it lo...,1,amazon
70,"Even in my BMW 3 series which is fairly quiet,...",0,amazon
80,Not a good bargain.,0,amazon
90,Made very sturdy.,1,amazon


In [7]:
# using iloc
# from record #10to the end , show every 5record and only show 15records int total
# and show 0 1 column
senti_label.iloc[10::5,:2][0:15]

Unnamed: 0,sentence,label
10,He was very impressed when going from the orig...,1
15,I advise EVERYONE DO NOT BE FOOLED!,0
20,I bought this to use with my Kindle Fire and a...,1
25,I've owned this phone for 7 months now and can...,1
30,This product is ideal for people like me whose...,1
35,It has kept up very well.,1
40,I was not impressed by this product.,0
45,Who in their right mind is gonna buy this batt...,0
50,good protection and does not make phone too bu...,1
55,VERY DISAPPOINTED.,0


In [8]:
# using loc
senti_label.loc[::5, 'sentence'][0:15]

0                           Good case, Excellent value.
5     If you have several dozen or several hundred c...
10    He was very impressed when going from the orig...
15                  I advise EVERYONE DO NOT BE FOOLED!
20    I bought this to use with my Kindle Fire and a...
25    I've owned this phone for 7 months now and can...
30    This product is ideal for people like me whose...
35                            It has kept up very well.
40                 I was not impressed by this product.
45    Who in their right mind is gonna buy this batt...
50    good protection and does not make phone too bu...
55                                   VERY DISAPPOINTED.
60    I really recommend this faceplates since it lo...
65    A week later after I activated it, it suddenly...
70    Even in my BMW 3 series which is fairly quiet,...
Name: sentence, dtype: object

In [9]:
senti_label[senti_label['label']==0].iloc[::10][0:5]

Unnamed: 0,sentence,label,provider
2,Tied to charger for conversations lasting more...,0,amazon
21,The commercials are the most misleading.,0,amazon
38,worthless product.,0,amazon
62,Buy a different phone - but not this.,0,amazon
83,"This item worked great, but it broke after 6 m...",0,amazon


## 3. Data Mining using Pandas
### 3.1 Dealing with Missing Values

In [10]:
senti_label.isnull()
senti_label.isnull().apply(lambda x: dmh.check_missing_values(x))

sentence    (The amoung of missing records is: , 0)
label       (The amoung of missing records is: , 0)
provider    (The amoung of missing records is: , 0)
dtype: object

In [11]:
senti_label.isnull().apply(lambda x: dmh.check_missing_values(x), axis=1)

0      (The amoung of missing records is: , 0)
1      (The amoung of missing records is: , 0)
2      (The amoung of missing records is: , 0)
3      (The amoung of missing records is: , 0)
4      (The amoung of missing records is: , 0)
5      (The amoung of missing records is: , 0)
6      (The amoung of missing records is: , 0)
7      (The amoung of missing records is: , 0)
8      (The amoung of missing records is: , 0)
9      (The amoung of missing records is: , 0)
10     (The amoung of missing records is: , 0)
11     (The amoung of missing records is: , 0)
12     (The amoung of missing records is: , 0)
13     (The amoung of missing records is: , 0)
14     (The amoung of missing records is: , 0)
15     (The amoung of missing records is: , 0)
16     (The amoung of missing records is: , 0)
17     (The amoung of missing records is: , 0)
18     (The amoung of missing records is: , 0)
19     (The amoung of missing records is: , 0)
20     (The amoung of missing records is: , 0)
21     (The a

__Creating two records with missing  value in bioth way( series amd dict)__

In [13]:
dummy_series = pd.Series(["dummy_record_series", 1], index=["sentence", "label"])
dummy_dict = [{'sentence': 'dummy_record_dic',
               'label': 1
              }]
print ("dummy_series\n",dummy_series)
print ("\ndummy_dict\n",dummy_dict)
senti_label = senti_label.append(dummy_series, ignore_index=True)
senti_label = senti_label.append(dummy_dict, ignore_index=True)
print
print("\n",senti_label[-5:])
print
print("\nlength:",len(senti_label))
senti_label.isnull()[-5:]

dummy_series
 sentence    dummy_record_series
label                         1
dtype: object

dummy_dict
 [{'sentence': 'dummy_record_dic', 'label': 1}]

                                                sentence  label provider
2744  Then, as if I hadn't wasted enough of my life ...      0     yelp
2745                                dummy_record_series      1      NaN
2746                                   dummy_record_dic      1      NaN
2747                                dummy_record_series      1      NaN
2748                                   dummy_record_dic      1      NaN

length: 2749


Unnamed: 0,sentence,label,provider
2744,False,False,False
2745,False,False,True
2746,False,False,True
2747,False,False,True
2748,False,False,True


In [15]:
senti_label.isnull().apply(lambda x: dmh.check_missing_values(x))

sentence    (The amoung of missing records is: , 0)
label       (The amoung of missing records is: , 0)
provider    (The amoung of missing records is: , 4)
dtype: object

__Drop all the records with missing value__

In [20]:
senti_label.dropna(inplace=True)
senti_label.isnull().apply(lambda x: dmh.check_missing_values(x))
print("length:",len(senti_label))

length: 2745


### 3.2 Dealing with Duplicate Data

In [29]:
senti_label.duplicated()
senti_label.isnull().apply(lambda x: dmh.check_missing_values(x), axis=1)[-5:]

2708    (The amoung of missing records is: , 0)
2709    (The amoung of missing records is: , 0)
2710    (The amoung of missing records is: , 0)
2711    (The amoung of missing records is: , 0)
2712    (The amoung of missing records is: , 0)
dtype: object

In [30]:
dummy_duplicate_dict = [{
                             'sentence': 'dummy record',
                             'label': 1, 
                             'provider': "dummy category"
                        },
                        {
                             'sentence': 'dummy record',
                             'label': 1, 
                             'provider': "dummy category"
                        }]

In [31]:
senti_label = senti_label.append(dummy_duplicate_dict, ignore_index=True)
print ("length: ",len(senti_label))
print ("sum:",sum(senti_label.duplicated('sentence')))

length:  2715
sum: 3


__Drop all the records with duplicate data__

In [32]:
senti_label.drop_duplicates(keep=False, inplace=True) 
print ("length: ",len(senti_label))
print ("sum:",sum(senti_label.duplicated('sentence')))

length:  2711
sumL: 0


## 4.  Data Preprocessing
### 4.1 Sampling

In [36]:
senti_label_sample = senti_label.sample(n=1000)
print ("type", type(senti_label_sample))
print ("length: ",len(senti_label_sample))

type <class 'pandas.core.frame.DataFrame'>
length:  1000
