# Project Part 1

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/CS39AA-project/blob/main/project_part1.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/CS39AA-project/blob/main/project_part1.ipynb)

## 1. Introduction/Background

_In this section you will describe (in English) the dataset you are using as well as the NLP problem it deals with. For example, if you are planning to use the Twitter Natural Disaster dataset, then you will describe what the data and where it came as if you were explaining it to someone who does not know anything about the data. You will then describe how this is a __text classification__ problem, and that the labels are binary (e.g. a tweet either refers to a genuine/real natural disaster, or it does not)._ 

_Overall, this should be about a paragraph of text that could be read by someone outside of our class, and they could still understand what it is your project is doing._ 

_Note that you should __not__ simply write one sentence stating, "This project is base on the Kaggle competition: Predicting Natural Disasters with Twitter._"

_If you instead are planning to do a more research-oriented or applied type of project, then describe what it is that you plan to do._

_If it is research, then what do you want to understand/explain better?_


## Introduction

The purpose of this project is to explore the effects of different hyperparameters, in terms of accuracy, on their respective models. I will start with random forest and compare it side by side, or progress to, other models. It should be noted this is an deeper analysis of internal hyperperameraters and may or may not delve into hidden layers, dropout rates, etc., and is purely informational. The results from this project are not meant to be used as a final model and are more intended to be used as a potential starting point when building Machine Learning or Deep Learning models.

The chosen dataset for this project is "Starbucks Reviews Dataset" published by Harshal H on kaggle at https://www.kaggle.com/datasets/harshalhonde/starbucks-reviews-dataset.

## 2. Exploratory Data Analysis

_You will now load the dataset and carry out some exploratory data analysis steps to better understand what text data looks like. See the examples from class on 10/. The following links provide some good resources of exploratory analyses of text data with Python._


* https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
* https://regenerativetoday.com/exploratory-data-analysis-of-text-data-including-visualization-and-sentiment-analysis/
* https://medium.com/swlh/text-summarization-guide-exploratory-data-analysis-on-text-data-4e22ce2dd6ad  
* https://www.kdnuggets.com/2019/05/complete-exploratory-data-analysis-visualization-text-data.html  




In [1]:
import pandas as pd
import numpy as np

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# ...

/kaggle/input/starbucks-reviews-dataset/reviews_data.csv


First we will open up the dataset to take a peek at the contents. The code above will give us the filepath to the dataset, and the code below prints out the first few entries of the dataset...

In [2]:
input_data_path = '/kaggle/input/starbucks-reviews-dataset/'
training_data_file = 'reviews_data.csv'
df = pd.read_csv(input_data_path + training_data_file)
df.head()

Unnamed: 0,name,location,Date,Rating,Review,Image_Links
0,Helen,"Wichita Falls, TX","Reviewed Sept. 13, 2023",5.0,Amber and LaDonna at the Starbucks on Southwes...,['No Images']
1,Courtney,"Apopka, FL","Reviewed July 16, 2023",5.0,** at the Starbucks by the fire station on 436...,['No Images']
2,Daynelle,"Cranberry Twp, PA","Reviewed July 5, 2023",5.0,I just wanted to go out of my way to recognize...,['https://media.consumeraffairs.com/files/cach...
3,Taylor,"Seattle, WA","Reviewed May 26, 2023",5.0,Me and my friend were at Starbucks and my card...,['No Images']
4,Tenessa,"Gresham, OR","Reviewed Jan. 22, 2023",5.0,I’m on this kick of drinking 5 cups of warm wa...,['https://media.consumeraffairs.com/files/cach...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         850 non-null    object 
 1   location     850 non-null    object 
 2   Date         850 non-null    object 
 3   Rating       705 non-null    float64
 4   Review       850 non-null    object 
 5   Image_Links  850 non-null    object 
dtypes: float64(1), object(5)
memory usage: 40.0+ KB


In [4]:
df.shape

(850, 6)

In [5]:
df.columns.tolist()

['name', 'location', 'Date', 'Rating', 'Review', 'Image_Links']



The first peek into the dataset shows us quite a bit. First, the data consists of 6 columns as listed above. For the purposes of this model the only columns that will be used are 'Rating' and 'Review'as they are the two columns that give the relevant data. 'Name', 'location', and 'Image_Links' are irrellevent to what we want to look at, and though 'Date' may yeild something useful, perhaps there was a trend towards positive or negative reviews depending on the year, it is also not needed as that is beyond the scope of this particular project. 
Second, there are 850 entries in the dataset, of which there are 705 non-null values in the ratings column. The null data will be removed so that there is less chance that bad data is introduced, potentially decreasing the models predictions.