# Analysis of Smartphone Reviews

github: https://github.com/sundawei2018/web-analytics-final-project.git

## 1. Using web crawler to scrape online reviews 
- The Best Buy website does not allow web crawler to get users' data. Therefore, we have to use Selenium web driver which can automate any web browsers in order to access the information we want to obtain
- For each smartphone, we need to put correct URL into the page_url_arr and specify the name of the file we want to save reviews into.
- In order to run this block, **you have to specify a correct path of a chromedriver**.
- It takes time for the automated web brower to scrape data
- **We suggest not to run this block**

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
import csv

def getReviews(page_url):
    reviews = []
    i = 1
    # chrome driver path 
    chrome_path = r"C:\Users\dsun2\Documents\BIA 660\project\chromedriver.exe"
    driver = webdriver.Chrome(chrome_path)
    
    while page_url != None:
        driver.get(page_url + str(i))
        i = i + 1
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        divs = soup.select("div.container-fluid div.reviews-content-wrapper ul li.review-item")
        if len(divs) == 0:
            page_url = None
        else:  
            for idx, div in enumerate(divs):
                rating = None
                comment = None
                review = []
                rating_tmp = div.select("div div div div div div span.reviewer-score")
                # get rating
                if rating_tmp != []:
                    rating = rating_tmp[0].get_text().encode('utf-8')
                comment_tmp = div.select("div div div div p.pre-white-space")
                # get review                 
                if comment_tmp != []:
                    comment = comment_tmp[0].get_text().encode('utf-8')
                
                review = [rating, comment]
                reviews.append(review)
    return reviews

def save(reviews):
    file_name = "raw_Note8_review.csv"
    with open(file_name, "wb") as f:
        writer = csv.writer(f, dialect = 'excel')
        writer.writerows(reviews)
    f.close()

if __name__ == "__main__":
    
    # URLs which we want to collect from
    
#    iphone reviews
#    "https://www.bestbuy.com/site/reviews/apple-iphone-8-64gb-gold-verizon/6009932?sort=MOST_HELPFUL&page=",
#    "https://www.bestbuy.com/site/reviews/apple-iphone-8-256gb-space-gray-at-t/6009695?sort=MOST_HELPFUL&page=",
#    "https://www.bestbuy.com/site/reviews/apple-iphone-8-64gb-gold-sprint/6009816?sort=MOST_HELPFUL&page="
    
#    S8 reviews
#    https://www.bestbuy.com/site/reviews/samsung-galaxy-s8-64gb-midnight-black-verizon/5789008?sort=MOST_HELPFUL&page=
    
#    Note 8 reivews
#    https://www.bestbuy.com/site/reviews/samsung-galaxy-s8-64gb-midnight-black-at-t/5770905?sort=MOST_HELPFUL&page=4
    page_url_arr = ["https://www.bestbuy.com/site/reviews/samsung-galaxy-s8-64gb-midnight-black-at-t/5770905?sort=MOST_HELPFUL&page="]
    
    # For each one of them, get reviews and save them to a local file
    for page_url in page_url_arr:
        reviews = getReviews(page_url)
        save(reviews)

## 2. Cleaning reviews
- Tokenize each review into unigram 
- Remove stop words such as a, an, the, and, or
- Remove punctuations, digits, and words start with '
- Save cleaned data into a file
- **Cleaned reviews contain empty lines between two adjacent reviews after running this block which will double the sample size**
- **We need to go to the csv file and select empty lines in order to remove them**
- **We suggest not to run this block**

In [2]:
import csv
import nltk, string
stop_words = ['a', 'an', 'the', 'and', 'or']

def get_review_tokens(review):
    # unigram tokenization pattern
    pattern = r'\w+[\-]*\w+'                          
    # get unigrams
    tokens=[token.strip() \
            for token in nltk.regexp_tokenize(review.lower(), pattern) \
            if token.strip() not in string.punctuation and \
            token.strip() not in stop_words and \
            token.strip() if not token.isdigit() and \
            token.strip() if not token.startswith('\'')
            ]    
    return tokens

def clean_review(tokens):
    review = " ".join(get_review_tokens(tokens))
    return review   

def save(reviews):
    with open("clean_iphone8_review.csv", "wb") as f:
        writer = csv.writer(f, dialect = 'excel')
        writer.writerows(reviews)
    f.close()

if __name__ == "__main__":
    f = open("raw_iphone8_review.csv", "rb")
    reader = csv.reader(f)
    reviews = []
    for utf8_row in reader:
        unicode_row = [row.decode('utf8') for row in utf8_row]
        review = [unicode_row[0], clean_review(unicode_row[1])]
        reviews.append(review)
    f.close()
    save(reviews)

## 3. Creating Labels for Reviews
- Open three files and create three arrays to store reviews respectively
- Labels which our group used are battery, camera, processor, screen, and others
- Label each review and create a new file called training_data.csv

In [4]:
import csv

if __name__ == "__main__":
    with open("clean_iPhone8_review.csv", "rb") as f:
        reader = csv.reader(f, dialect = 'excel')
        iPhone_reviews = [row for row in reader]

    with open("clean_Note8_review.csv", "rb") as f:
        reader = csv.reader(f, dialect = 'excel')
        Note8_reviews = [row for row in reader]
    
    with open("clean_S8_review.csv", "rb") as f:
        reader = csv.reader(f, dialect = 'excel')
        S8_reviews = [row for row in reader]
    
    
    # iphone reviews
    battery_iPhone_idx = [8, 9, 20, 21, 40, 42, 96, 102, 131, 139, 141, 146, \
                         147, 148, 157, 162, 180, 183, 186, 197, 209, 210, 230, \
                         234, 243, 248, 274, 280, 294, 297, 299, 303, 315,325, 343, \
                         356, 399, 404, 407, 416, 417, 426, 431, 445, 446, 495, 522, 530, 539, 540, 550, \
                         553,562, 583, 590, 606] 
    
    camera_iPhone_idx = [2, 10, 11, 12, 13, 30, 34, 39, 47, 49, 58, 61, 67, 85, 90, \
                         91, 105, 109, 116, 119, 120, 129, 132, 134, 137, 138, 144, 145, \
                         152, 154, 155, 164, 165, 168, 175, 177, 181, 187, 189, 194, 195, 202, \
                         206, 221, 227, 228, 233, 235, 238, 240, 246, 247, 251, 259, 260, 264, \
                         267, 277, 290, 291, 307, 312, 314, 321, 324, 329, 331, 332, 334, 335, \
                         338, 340, 347, 349, 350, 351, 358, 359, 360, 364, 368, 374, 376, 379, \
                         383, 394, 402, 409, 421, 425, 437, 443, 457, 463, 467, 468, 470, 473, \
                         476, 484, 485, 488, 491, 500]
    
    screen_iPhone_idx = [14, 15, 26, 53, 66, 71, 73, 79, 84, 126, 254, 271, 272, 298, 328, 354, \
                         377, 391, 429, 441, 466, 482, 533, 611]
    
    processor_iPhone_idx = [86, 119, 134, 141, 154, 165, 182, 185, 190, 191, 192, 215, 261, 297, 300, \
                            302, 308, 410, 528, 602, 603]
    
    others_idx = [0, 1, 3, 4, 5, 6, 7, 16, 17, 18, 19, 22, 23, 24, 25, 27, 28, 29, 31, 32, 33, 35, \
                  36, 37, 38, 41, 43, 44, 45, 46, 48, 50, 51, 52, 54, 55, 56, 57, 59, 60, 62, 63, 64, \
                  65, 68, 69, 70, 72, 74, 75]
    
#    Note 8 reviews
    battery_note8_idx = [0, 1, 6, 9, 19, 21, 22, 26, 27, 31, 61, 65, 67, 73, 87, 94, 98, 110, 112, \
                         118, 132, 133, 138, 141, 146, 151, 173, 177, 201, 206, 208, 209, 225, 230, 239, \
                         254, 269, 273, 277, 287, 291, 296, 299, 305, 308, 314, 317, 323, 331, 332, 333, 334, \
                         348, 374, 396, 402, 403, 410, 415, 428, 430, 431, 437, 444, 448, 449, 450, 457, 460, \
                         466, 482, 504, 505, 514, 515, 516, 519, 520, 524, 556, 566, 574, 580, 602, 606, 612, \
                         629, 632, 641, 643, 648, 654, 674, 679, 683, 696, 700, 701, 716, 730, 766]
    
    camera_note8_idx = [4, 8, 12, 17, 28, 29, 32, 33, 34, 37, 45, 46, 50, 51, 54, 58, 66, 70, 75, \
                        80, 81, 83, 85, 91, 102, 107, 111, 113, 116, 120, 128, 129, 135, 136, 147, 148, \
                        156, 161, 169, 178, 182, 183, 188, 190, 193, 200, 204, 205, 237, 247, 250, 253, 257, \
                        261, 263, 265, 270, 274, 279, 283, 288, 292, 295, 298, 301, 315, 316, 318, 326, 337, \
                        349, 350, 352, 353, 356, 360, 364, 381, 383, 389, 393, 414, 422, 440, 441, 454, 458, \
                        459, 462, 463, 465, 468, 469, 478, 479, 484, 487, 489, 493, 498]
    
    screen_note8_idx = [5, 7, 10, 20, 39, 42, 56, 60, 63, 71, 127, 137, 163, 176, 180, 184, 191, 192, \
                        199, 202, 203, 211, 213, 218, 235, 236, 240, 255, 258, 271, 281, 328, 330, 363, \
                        378, 387, 388, 390, 392, 399, 401, 408, 412, 413, 421, 424, 433, 436, 438, 439, \
                        464, 471, 474, 475, 491, 495, 497, 512, 526, 545, 546, 547, 552, 562, 563, 570, \
                        591, 595, 601, 603, 605, 609, 611, 613, 626, 630, 631, 635, 658, 661, 666, 668, \
                        680, 695, 705, 708, 709, 711, 717, 722, 723, 729, 732, 734, 744, 751, 752, 762, \
                        763, 765, 796, 797]
    
    processor_note8_idx = [1, 2, 71, 212, 232, 271, 369, 385, 425, 535, 637, 667, 750, 754, 774, 794, \
                           800, 805, 813, 832, 837, 878, 886, 902]
    
    others_note8_idx = [3, 11, 13, 14, 15, 16, 18, 23, 24, 25, 30, 35, 36, 38, 40, 41, 43, 44, 47, 48, \
                        49, 52, 53, 55, 57, 59, 62, 64, 68, 69, 72, 74, 76, 77, 78, 79, 82, 84, 86, 88, 89, \
                        90, 92, 93, 95, 96, 97, 99, 100, 101]

#   S8 reviews
    battery_s8_idx = [0, 2, 10, 12, 14, 16, 20, 25, 32, 46, 52, 65, 66, 69, 77, 78, 88, 90, 91, 92, 98, \
                      116, 117, 127, 139, 151, 160, 167, 173, 175, 176, 183, 189, 195, 201, 208, 210, 212, \
                      214, 216, 220, 224, 228, 234, 242, 247, 249, 260, 265, 274, 283, 289, 296, 297, 300, 304, \
                      316, 326, 328, 343, 344, 355, 361, 364, 367, 379, 395, 408, 410, 414, 427, 435, 437, 443, \
                      447, 450, 453, 457, 459, 465, 469, 471, 493, 510, 511, 512, 536, \
                      539, 544, 554, 555, 559, 568, 574, 577, 584, 585, 586, 589, 600]
    
    camera_s8_idx = [3, 5, 8, 11, 15, 18, 23, 27, 31, 36, 38, 41, 42, 44, 45, 48, 51, 55, 79, 85, 86, 96, \
                     101, 103, 105, 118, 119, 122, 128, 131, 137, 138, 141, 144, 153, 157, 161, 164, 169, \
                     171, 188, 193, 196, 198, 200, 202, 206, 207, 211, 217, 219, 223, 226, 241, 250, 261, 262, \
                     263, 264, 271, 277, 279, 285, 286, 294, 295, 311, 313, 317, 319, 322, 329, 334, 341, 347, 351, \
                     359, 366, 374, 375, 376, 382, 385, 391, 396, 399, 401, 405, 407, 409, 412, 416, 417, 419, 428, \
                     431, 432, 433, 436, 444]
    
    screen_s8_idx = [6, 7, 9, 13, 17, 21, 22, 26, 39, 47, 49, 50, 53, 56, 61, 62, 63, 67, 75, 87, 89, 93, \
                     110, 133, 136, 146, 184, 187, 204, 227, 230, 248, 257,258, 259, 269, 270, 275, 276, 284, \
                     290, 292, 293, 301, 306, 309, 331, 339, 346, 365, 371, 381, 393, 403, 411, 415, 423, 440, \
                     454, 463, 464, 480, 485, 504, 508, 516, 518, 524, 529, 530, 540, 552, 557, 565, 566, 569, \
                     571, 572, 579, 581, 587, 594, 595, 597, 606, 616, 620, 634, 635, 639, 648, 652, 670, 671, \
                     682, 691, 694, 733, 734]
    
    processor_s8_idx = [37, 140, 143, 152, 235, 535, 638, 713, 801, 844, 896, 913, 950, 991, 1017, 1046, \
                        1057, 1242, 1332, 1356, 1446, 1509, 1644, 1648, 1824]
    
    others_s8_idx = [1, 4, 19, 24, 28, 29, 30, 33, 34, 35, 40, 43, 54, 57, 58, 59, 60, 64, 68, 70, 71, 72, \
                  73, 74, 76, 80, 81, 82, 83, 84, 94, 95, 97,99, 100, 102, 104, 106, 107, 108, 109, 111, 112, \
                  113, 114, 115, 120, 121, 123, 124]

    with open("training_data.csv", "wb") as f:
        writer = csv.writer(f, dialect = 'excel')
        
        for idx in camera_iPhone_idx:
            writer.writerow(('camera', iPhone_reviews[idx][1]))
        for idx in battery_iPhone_idx:
            writer.writerow(('battery', iPhone_reviews[idx][1]))
        for idx in screen_iPhone_idx:
            writer.writerow(('screen', iPhone_reviews[idx][1]))
        for idx in processor_iPhone_idx:
            writer.writerow(('processor', iPhone_reviews[idx][1]))
        for idx in others_idx:
            writer.writerow(('others', iPhone_reviews[idx][1]))
            
        for idx in battery_note8_idx:
            writer.writerow(('battery', Note8_reviews[idx][1]))
        for idx in camera_note8_idx:
            writer.writerow(('camera', Note8_reviews[idx][1]))
        for idx in screen_note8_idx:
            writer.writerow(('screen', Note8_reviews[idx][1]))
        for idx in processor_note8_idx:
            writer.writerow(('processor', Note8_reviews[idx][1]))
        for idx in others_note8_idx:
            writer.writerow(('others', Note8_reviews[idx][1])) 
            
        for idx in battery_s8_idx:
            writer.writerow(('battery', S8_reviews[idx][1]))
        for idx in camera_s8_idx:
            writer.writerow(('camera', S8_reviews[idx][1]))
        for idx in screen_s8_idx:
            writer.writerow(('screen', S8_reviews[idx][1]))
        for idx in processor_s8_idx:
            writer.writerow(('processor', S8_reviews[idx][1]))
        for idx in others_s8_idx:
            writer.writerow(('others', S8_reviews[idx][1]))

## 4. Training Data and Using CNN for multi-label classification
- Load training_data.csv into file and create two lists. One list contains labels and the other one contains reviews
- Split training data into 30% for training and 70% for testing
- Use "Early Stopping" to stop training the model when testing accuracy/loss gets worse

In [1]:
import csv
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import MultiLabelBinarizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
from sklearn.model_selection import train_test_split
from keras.layers import Embedding, Dense, Conv1D, MaxPooling1D, Dropout, Activation, Input, Flatten, Concatenate
from keras.models import Model
from keras.regularizers import l2
from keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import classification_report

def cnn_model(FILTER_SIZES, \
              # filter sizes as a list
              MAX_NB_WORDS, \
              # total number of words
              MAX_DOC_LEN, \
              # max words in a doc
              NUM_OUTPUT_UNITS=1, \
              # number of output units
              EMBEDDING_DIM=200, \
              # word vector dimension
              NUM_FILTERS=64, \
              # number of filters for all size
              DROP_OUT=0.5, \
              # dropout rate
              PRETRAINED_WORD_VECTOR=None,\
              # Whether to use pretrained word vectors
              LAM=0.01):            
              # regularization coefficient
    
    main_input = Input(shape=(MAX_DOC_LEN,), dtype='int32', name='main_input')
    
    if PRETRAINED_WORD_VECTOR is not None:
        embed_1 = Embedding(input_dim=MAX_NB_WORDS+1, \
                        output_dim=EMBEDDING_DIM, \
                        input_length=MAX_DOC_LEN, \
                        weights=[PRETRAINED_WORD_VECTOR],\
                        trainable=False,\
                        name='embedding')(main_input)
    else:
        embed_1 = Embedding(input_dim=MAX_NB_WORDS+1, \
                        output_dim=EMBEDDING_DIM, \
                        input_length=MAX_DOC_LEN, \
                        name='embedding')(main_input)
    # add convolution-pooling-flat block
    conv_blocks = []
    for f in FILTER_SIZES:
        conv = Conv1D(filters=NUM_FILTERS, kernel_size=f, \
                      activation='relu', name='conv_'+str(f))(embed_1)
        conv = MaxPooling1D(MAX_DOC_LEN-f+1, name='max_'+str(f))(conv)
        conv = Flatten(name='flat_'+str(f))(conv)
        conv_blocks.append(conv)

    z=Concatenate(name='concate')(conv_blocks)
    drop=Dropout(rate=DROP_OUT, name='dropout')(z)

    dense = Dense(192, activation='relu',\
                    kernel_regularizer=l2(LAM),name='dense')(drop)
    preds = Dense(NUM_OUTPUT_UNITS, activation='sigmoid', name='output')(dense)
    model = Model(inputs=main_input, outputs=preds)
    
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) 
    return model

if __name__ == "__main__":
    labels = []
    reviews = []
    with open('training_data.csv', 'r') as f:
        reader = csv.reader(f, dialect = 'excel')
        for row in reader:
            labels.append([row[0]])
            reviews.append(row[1])
    f.close()
    
    BEST_MODEL_FILEPATH = 'cnn_model'
    
    mlb = MultiLabelBinarizer()
    Y = mlb.fit_transform(labels)
    # get a Keras tokenizer
    MAX_NB_WORDS=8000
    # documents are quite long in the dataset
    MAX_DOC_LEN=100
    
    tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
    tokenizer.fit_on_texts(reviews)
    # convert each document to a list of word index as a sequence
    sequences = tokenizer.texts_to_sequences(reviews)
    # get the mapping between words to word index
    
    # pad all sequences into the same length (the longest)
    padded_sequences = pad_sequences(sequences, \
                                     maxlen=MAX_DOC_LEN, \
                                     padding='post', truncating='post')
    NUM_OUTPUT_UNITS=len(mlb.classes_)

    
    EMBEDDING_DIM=100
    FILTER_SIZES=[2,3,4]
    
    BTACH_SIZE = 64
    NUM_EPOCHES = 20
    
    # split dataset into train (70%) and test sets (30%)
    X_train, X_test, Y_train, Y_test = train_test_split(padded_sequences, Y, test_size=0.3, random_state=0)
    
    
    model=cnn_model(FILTER_SIZES, MAX_NB_WORDS, MAX_DOC_LEN, NUM_OUTPUT_UNITS)
    
    earlyStopping=EarlyStopping(monitor='val_loss', patience=0, verbose=2, mode='min')
    checkpoint = ModelCheckpoint(BEST_MODEL_FILEPATH, monitor='val_acc', \
                                 verbose=2, save_best_only=True, mode='max')
        
    training = model.fit(X_train, Y_train, \
             batch_size=BTACH_SIZE, epochs=NUM_EPOCHES, \
             callbacks=[earlyStopping, checkpoint],\
             validation_data=[X_test, Y_test], verbose=2)

Using TensorFlow backend.


Train on 704 samples, validate on 302 samples
Epoch 1/20
Epoch 00000: val_acc improved from -inf to 0.80000, saving model to cnn_model
5s - loss: 2.3400 - acc: 0.7477 - val_loss: 2.0297 - val_acc: 0.8000
Epoch 2/20
Epoch 00001: val_acc did not improve
4s - loss: 1.8205 - acc: 0.8000 - val_loss: 1.6020 - val_acc: 0.8000
Epoch 3/20
Epoch 00002: val_acc did not improve
4s - loss: 1.4590 - acc: 0.8000 - val_loss: 1.2910 - val_acc: 0.8000
Epoch 4/20
Epoch 00003: val_acc improved from 0.80000 to 0.80199, saving model to cnn_model
4s - loss: 1.1741 - acc: 0.8017 - val_loss: 1.0252 - val_acc: 0.8020
Epoch 5/20
Epoch 00004: val_acc improved from 0.80199 to 0.89139, saving model to cnn_model
4s - loss: 0.9241 - acc: 0.8366 - val_loss: 0.7722 - val_acc: 0.8914
Epoch 6/20
Epoch 00005: val_acc improved from 0.89139 to 0.92715, saving model to cnn_model
4s - loss: 0.6963 - acc: 0.8912 - val_loss: 0.5744 - val_acc: 0.9272
Epoch 7/20
Epoch 00006: val_acc improved from 0.92715 to 0.93576, saving model 

## 5. Generate Classification Report
- Use X_test as input to CNN model to make prediction to obtain Y_pred
- Compare Y_pred and Y_test

In [5]:
model.load_weights("cnn_model")
pred = model.predict(X_test)
# create a copy of the predicated probabilities
Y_pred=np.copy(pred)
# if prob>0.5, set it to 1 else 0
Y_pred=np.where(Y_pred>0.5,1,0)

print(classification_report(Y_test, Y_pred, target_names=mlb.classes_))

             precision    recall  f1-score   support

    battery       0.97      0.96      0.97        75
     camera       0.99      0.99      0.99       100
     others       1.00      0.90      0.95        41
  processor       0.90      0.95      0.92        19
     screen       0.98      0.97      0.98        67

avg / total       0.98      0.96      0.97       302



## 6. Use CNN model to Classify Reviews
- After obtaining a relatively good model, it can be used to make predictions
- Use customers' reviews as input to the model, output will be corresponding feature predicted by using the model
- The output of the CNN is a list containing five items which correspond to [battery, camera, others, processor, screen]
- Store reviews in different arrays

In [54]:
    def classify_reviews(file_name):
        with open(file_name, "r") as f:
            reader = csv.reader(f, dialect = 'excel')
            reviews = [line[1] for line in reader]
        f.close()

        tokenizer.fit_on_texts(reviews)
        # convert each document to a list of word index as a sequence
        sequences = tokenizer.texts_to_sequences(reviews)
        # get the mapping between words to word index

        # pad all sequences into the same length (the longest)
        padded_sequences = pad_sequences(sequences, \
                                         maxlen=MAX_DOC_LEN, \
                                         padding='post', truncating='post')

        print ("the number of Note 8 reviews : ", len(padded_sequences))

        # use CNN model to make predictions
        preds = model.predict(padded_sequences)
        labels = np.copy(preds)
        # set probability greater than 0.5 to 1
        labels = np.where(labels > 0.5, 1, 0)

        #[battery, camera, others, processor, screen]
        battery_arr = []
        camera_arr = []
        others_arr = []
        processor_arr = []
        screen_arr = []

        for i in range(len(preds)):
            if labels[i][0] == 1:
                battery_arr.append(reviews[i])
            elif labels[i][1] == 1:
                camera_arr.append(reviews[i])
            elif labels[i][3] == 1:
                processor_arr.append(reviews[i])
            elif labels[i][4] == 1:
                screen_arr.append(reviews[i])
            else:
                others_arr.append(reviews[i])

        print ("battery: " , len(battery_arr))
        print ("camera: ", len(camera_arr))
        print ("processor: ", len(processor_arr))
        print ("screen: ", len(screen_arr))
        print ("others ", len(others_arr))
        
        return battery_arr, camera_arr, others_arr, processor_arr, screen_arr
        
    if __name__ == "__main__":
        print ("Summary of Note8 reviews: ")
        Note8_battery, Note8_camera, Note8_others, Note8_processor, Note8_screen = classify_reviews("clean_Note8_review.csv")
        
        print ("Summary of S8 reviews: ")
        S8_battery, S8_camera, S8_others, S8_processor, S8_screen = classify_reviews("clean_S8_review.csv")
        
        print ("Summary of iPhone8 reviews: ")
        iPhone8_battery, iPhone8_camera, iPhone8_others, iPhone8_processor, iPhone8_screen = classify_reviews("clean_iPhone8_review.csv")

Summary of Note8 reviews: 
the number of Note 8 reviews :  921
battery:  178
camera:  132
processor:  51
screen:  124
others  436
Summary of S8 reviews: 
the number of Note 8 reviews :  1881
battery:  354
camera:  296
processor:  75
screen:  301
others  855
Summary of iPhone8 reviews: 
the number of Note 8 reviews :  633
battery:  97
camera:  92
processor:  20
screen:  86
others  338


## Print out some reviews
- Print some reviews to see how our model does with customers' reviews
- We realize that even though the precision and recall of our model is high, it cannot always make correct prediction with customers' reviews we scraped online

In [32]:
for idx, review in enumerate(S8_battery[0:10]):
    print (idx, review)

0 switched from lg g5 phone is very fast so far my battery is lasting longer than day with lot of use left screen on which still looks great as someone else pointed out screen protectors are pain not just zagg one none of them seem to fit slight curve phone is fast looks amazing
1 so was really excited to get this phone it looks great it fast call clarity is superb screen size is amazing camera is top notch but screen has already cracked phone has not been dropped yet like to think take good care of it noticed small spec in screen one afternoon it got worse over next few days now it runs up down entire screen honestly have no idea what did to make this happen guess now am stuck with it for more months luckily none of functionality seems lost so am still enjoying it for most part
2 decided time price were right to upgrade from my galaxy s5 which had for about three years to s8 overall am pleased with it compared to s5 s8 is thinner narrower made of slick shiny guerrilla glass this might

## 7. Conduct Sentiment Analysis
- For each feature of a smartphone, our group conducted sentiment analysis to get customers’ opinions. 
- One method we used was VADER which is an unsupervised method and based on lexicons of sentiment-related words. 
- The other way was to analyze reviews by using the positive and negative dictionary. 

### VADER (unsupervised training):
- Get sentiment of every customer 's review of each feature
- polarity_scores will return a dictionary contains postive, neutural, and negative score of each input sentence
- Because the neutral score is always the highest among three scores, we only consider negative and postive scores
- If positive score is greater than negative score, pos is incremented by 1 and vice versa.
- The function also will return the ratio of pos

In [46]:
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    sid = SentimentIntensityAnalyzer()
    
    def get_sentiment(feature_reviews):
        pos = 0
        neg = 0
        for feature_review in feature_reviews:
            ss = sid.polarity_scores(feature_review)
            if (ss['neg'] > ss['pos']):
                neg += 1
            else:
                pos += 1
        print ("The number of positive reviews: ", pos, "\nThe number of negative reviews: ", neg)
        print ("The positive review ratio: ", pos / (pos + neg))
    
    if __name__ == "__main__":
        print ("Note 8 features: ")
        print ("Battery")
        get_sentiment(Note8_battery)
        print ("Camera")
        get_sentiment(Note8_camera)
        print ("Processor")
        get_sentiment(Note8_processor)
        print ("Screen")
        get_sentiment(Note8_screen)
        
        print ("\nS8 features: ")
        print ("Battery")
        get_sentiment(S8_battery)
        print ("Camera")
        get_sentiment(S8_camera)
        print ("Processor")
        get_sentiment(S8_processor)
        print ("Screen")
        get_sentiment(S8_screen)
        
        print ("\niPhone 8 features: ")
        print ("Battery")
        get_sentiment(iPhone8_battery)
        print ("Camera")
        get_sentiment(iPhone8_camera)
        print ("Processor")
        get_sentiment(iPhone8_processor)
        print ("Screen")
        get_sentiment(iPhone8_screen)

Note 8 features: 
Battery
The number of positive reviews:  167 
The number of negative reviews:  11
The positive review ratio:  0.9382022471910112
Camera
The number of positive reviews:  113 
The number of negative reviews:  21
The positive review ratio:  0.8432835820895522
Processor
The number of positive reviews:  33 
The number of negative reviews:  7
The positive review ratio:  0.825
Screen
The number of positive reviews:  123 
The number of negative reviews:  3
The positive review ratio:  0.9761904761904762

S8 features: 
Battery
The number of positive reviews:  323 
The number of negative reviews:  20
The positive review ratio:  0.9416909620991254
Camera
The number of positive reviews:  254 
The number of negative reviews:  30
The positive review ratio:  0.8943661971830986
Processor
The number of positive reviews:  114 
The number of negative reviews:  5
The positive review ratio:  0.957983193277311
Screen
The number of positive reviews:  269 
The number of negative reviews:  14


### Using positive and negative dictionary
- Count positve and negative words of reviews for given feature of a smartphone

In [69]:
import nltk
from nltk.corpus import stopwords
import csv

def sentiment_analysis(text):
    tokens=tokenize(text)
    vocabulary= set(tokens)
    dictionary={word: tokens.count(word) for word in vocabulary}
    stop_words = stopwords.words('english')
    filtered_dictionary={word: dictionary[word] for word in dictionary if word not in stop_words}

    with open("positive-words.txt",'r') as f:
        positive_words=[line.strip() for line in f]

        positive_tokens=[token for token in tokens if token in positive_words]

        pln =  len(positive_tokens)
        print ("the number of positive words: ", pln)

    with open("negative-words.txt",'r') as f:
        negative_words=[line.strip() for line in f]

        negative_tokens=[token for token in tokens if token in negative_words]
        nln = len(negative_tokens)
        print ("the number of negative words: ", nln)
        rate = pln*1.0/(pln+nln)
        print ("the ratio of positive words: ",rate)

    sentiment=""
    if len(negative_tokens) > len(positive_tokens):
        sentiment="negative"
    elif len(negative_tokens)<len(positive_tokens):
        sentiment="positive"
    else:
        sentiment="neutral"
    return sentiment

def tokenize(text):

    text=text.lower()
    pattern=r'[a-z]+[\-\.]*[a-z]*[\-\.]*' 
    tokens=nltk.regexp_tokenize(text, pattern)

    return tokens

if __name__ == "__main__":
    
    print ("Note 8 features: ")
    print ("Battery")
    text= " ".join(Note8_battery)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Camera")
    text= " ".join(Note8_camera)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Processor")
    text= " ".join(Note8_processor)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Screen")
    text= " ".join(Note8_screen)
    ss = sentiment_analysis(text)
    print (ss)

    print ("\nS8 features:")
    print ("Battery")
    text= " ".join(S8_battery)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Camera")
    text= " ".join(S8_camera)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Processor")
    text= " ".join(S8_processor)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Screen")
    text= " ".join(S8_screen)
    ss = sentiment_analysis(text)
    print (ss)
    
    print ("\niPhone 8 features:")
    print ("Battery")
    text= " ".join(iPhone8_battery)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Camera")
    text= " ".join(iPhone8_camera)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Processor")
    text= " ".join(iPhone8_processor)
    ss = sentiment_analysis(text)
    print (ss)
    print ("Screen")
    text= " ".join(iPhone8_screen)
    ss = sentiment_analysis(text)
    print (ss)
    

Note 8 features: 
Battery
the number of positive words:  619
the number of negative words:  127
the ratio of positive words:  0.8297587131367292
positive
Camera
the number of positive words:  424
the number of negative words:  127
the ratio of positive words:  0.7695099818511797
positive
Processor
the number of positive words:  195
the number of negative words:  56
the ratio of positive words:  0.7768924302788844
positive
Screen
the number of positive words:  358
the number of negative words:  44
the ratio of positive words:  0.8905472636815921
positive

S8 features:
Battery
the number of positive words:  1426
the number of negative words:  295
the ratio of positive words:  0.8285880302149913
positive
Camera
the number of positive words:  999
the number of negative words:  263
the ratio of positive words:  0.7916006339144216
positive
Processor
the number of positive words:  302
the number of negative words:  104
the ratio of positive words:  0.7438423645320197
positive
Screen
the numbe