# Project 4: Yelp Sentiment Analysis

## Dataset: 

Yelp Reviews Dataset (available from Yelp Dataset Challenge or Kaggle)

## Analysis Goals:

1. Data Understanding:
    - Explore the structure of the dataset, including features like review text, star ratings, and business categories.
        Understand the distribution of star ratings and sentiment labels (e.g., positive, negative) in the dataset.
      
2. Text Preprocessing:
    - Tokenization: Split the tweets into individual words or tokens.
    - Removing stopwords: Eliminate common words that don't carry much sentiment information.
    - Stemming/Lemmatization: Reduce words to their root form to reduce dimensionality.
    - Handle mentions, hashtags, and URLs: Remove or replace mentions, hashtags, and URLs with placeholders.

3. Feature Extraction:
    - Bag-of-Words (BoW): Represent text data as a matrix of word frequencies.
    - TF-IDF (Term Frequency-Inverse Document Frequency): Weigh the importance of words in a document relative to their frequency in the corpus.
    - Word Embeddings: Represent words in a continuous vector space, such as Word2Vec or GloVe embeddings.

4. Model Building:
    - Algorithms: Use algorithms suitable for text classification
        - Naive Bayes
        - Logistic Regression
        - Support Vector Machines
        - Neural network-based models like LSTM or CNN.
    - Train/Test Split: Split the data into training and testing sets, typically using an 80/20 or 70/30 split.
   - Model evaluation:
       - Accuracy
       - Precision
       - Recall
       - F1-score
       - ROC-AUC
5. Hyperparameter Tuning:
    - Use techniques like GridSearchCV or RandomizedSearchCV to find optimal hyperparameters for each model.
    - Perform cross-validation with k-fold to ensure robustness of parameter selection.

6. Model Evaluation:
    - Evaluate the trained models using metrics such as accuracy, precision, recall, and F1-score.
    - Use techniques like k-fold cross-validation to assess the generalization performance of the models.
    - Visualize model performance using confusion matrices and ROC curves.

7. Analysis of Results:
    - Examine misclassified examples: Analyze misclassified tweets to understand common errors made by the model.
    - Interpret feature importance: If using models like Logistic Regression or linear SVM, analyze feature coefficients or weights to understand which words contribute most to sentiment classification.
    - Optionally, analyze sentiment trends across different business categories or locations within the dataset.

## Analysis

### Load Data

| Field Name  | Description                                                                    |
|-------------|--------------------------------------------------------------------------------|
| review_id   | 22 character unique review id                                                   |
| user_id     | 22 character unique user id, maps to the user in user.json                      |
| business_id | 22 character business id, maps to business in business.json                     |
| stars       | Star rating (integer)                                                           |
| date        | Date formatted YYYY-MM-DD                                                       |
| text        | The review itself (string)                                                      |
| useful      | Number of useful votes received (integer)                                        |
| funny       | Number of funny votes received (integer)                                         |
| cool        | Number of cool votes received (integer)                                          |


In [36]:
import pandas as pd
df = pd.read_csv('data/Yelp_Review_Data.csv',nrows=100000)
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


### Data Preprocessing

In [37]:
import numpy as np

# Drop columns we don't need
df = df.drop(['Unnamed: 0', 'review_id', 'user_id', 'business_id'], axis=1)
df.head()

Unnamed: 0,stars,useful,funny,cool,text,date
0,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
