# Lab 8: Define and Solve an ML Problem of Your Choosing

In [3]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [4]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


dfCensus = pd.read_csv(adultDataSet_filename, header = 0)
dfCensus.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income_binary
0,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Non-Female,2174,0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Non-Female,0,0,13.0,United-States,<=50K
2,38.0,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Non-Female,0,0,40.0,United-States,<=50K
3,53.0,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Non-Female,0,0,40.0,United-States,<=50K
4,28.0,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.0,Cuba,<=50K


In [5]:
df = pd.read_csv(airbnbDataSet_filename, header = 0)
df.head()
print(df.columns)
# print(dfAirbnb["review_scores_value"])
#predict whether the review is a 4 or a 5 based on what the description was?
# print(dfAirbnb[[col for col in dfAirbnb if col.startswith('review_scores')]])
df.columns[df.dtypes == "object"]
# dfAirbnb.head()

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'neighbourhood_group_cleansed',
       'room_type', 'amenities'],
      dtype='object')

In [6]:
dfWorldHappiness = pd.read_csv(WHRDataSet_filename, header = 0)
dfWorldHappiness.head()

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

The data set I have chosen will be the AirBnB data set. I will be predicting the review scores rating - whether it is above 4 stars, or below 4 stars, based on the other information in the data set. This will be a classification problem, and will be a binary classification problem. The ratings greater than or equal to 4 will be "good" reviews, and ratings less than 4 are "bad" reviews. 

I am planning to use a combination of NLP and neural networks to figure out the ratings using a few of the features available, such as the description, the number of bathrooms and bedrooms, the neighborhood description, and the price. 

A company would create value by examining which features have the greatest effect on rating, and would be able to make AirBnBs with higher ratings by implementing certain features and changing others. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [7]:
# This code cell removes all review scores except for the review_score_rating. 
# Then, a new column is created to determine whether a rating is "good" or "bad." 
# This column will be the predicted label, called "Good AirBnB."
review_col_names = [col for col in df 
                             if col.startswith('review_scores')]
review_col_names.remove("review_scores_rating")
df = df.drop(columns = review_col_names, axis = 0)

In [8]:
#Make sure there are no null values
nan_count = np.sum(df["review_scores_rating"].isnull(), axis = 0) > 0
print(nan_count)


False


In [9]:
# Adds the Good AirBnB column to distinguish whether it is a good hotel
# based on whether the review scores rating column is higher than 4.
# Removes the review scores rating column
df["Good AirBnB"] = df["review_scores_rating"] >= 4
df = df.drop(columns = "review_scores_rating", axis = 0)

cols = df.select_dtypes(include='bool').columns
d = {}
for c in cols:
    d[c] = "float64"

df = df.astype(d)

In [10]:
# Remove some features that may be irrelevant, which are host_name, host_about, 
# and host_location
df = df.drop(columns = ["host_name", "host_about", "host_location"], axis = 0)
df.head()

Unnamed: 0,name,description,neighborhood_overview,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,...,number_of_reviews_ltm,number_of_reviews_l30d,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications,Good AirBnB
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,0.8,0.17,1.0,8.0,8.0,1.0,1.0,...,0,0,0.0,3,3,0,0,0.33,9,1.0
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,0.09,0.69,1.0,1.0,1.0,1.0,1.0,...,32,0,0.0,1,1,0,0,4.86,6,1.0
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,1.0,0.25,1.0,1.0,1.0,1.0,1.0,...,1,0,0.0,1,1,0,0,0.02,3,1.0
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,33,2,0.0,1,0,1,0,3.68,4,1.0
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,,,1.0,1.0,1.0,1.0,1.0,...,0,0,0.0,1,0,1,0,0.87,7,1.0


In [11]:
#One hot encode room_type and neighborhood_group_cleansed
df_room_type = pd.get_dummies(df['room_type'], prefix='room_type_')
df_room_type.head()
df = df.join(df_room_type)
df.drop(columns = 'room_type', inplace=True)

df_neighbourhood_cleansed = pd.get_dummies(df['neighbourhood_group_cleansed'], prefix = "neighborhood_group_cleansed_")
df_neighbourhood_cleansed.head()
df = df.join(df_neighbourhood_cleansed)
df.drop(columns = 'neighbourhood_group_cleansed', inplace=True)

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,...,Good AirBnB,room_type__Entire home/apt,room_type__Hotel room,room_type__Private room,room_type__Shared room,neighborhood_group_cleansed__Bronx,neighborhood_group_cleansed__Brooklyn,neighborhood_group_cleansed__Manhattan,neighborhood_group_cleansed__Queens,neighborhood_group_cleansed__Staten Island
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,0.8,0.17,1.0,8.0,8.0,1.0,1.0,...,1.0,1,0,0,0,0,0,1,0,0
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,0.09,0.69,1.0,1.0,1.0,1.0,1.0,...,1.0,1,0,0,0,0,1,0,0,0
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,1.0,0.25,1.0,1.0,1.0,1.0,1.0,...,1.0,1,0,0,0,0,1,0,0,0
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,0,0,1,0,0,0,1,0,0
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,,,1.0,1.0,1.0,1.0,1.0,...,1.0,0,0,1,0,0,0,1,0,0


In [12]:
# Deal with missing values
nan_count = np.sum(df.isnull(), axis = 0)
nan_count_not_zero = nan_count[nan_count > 0]

# For all the text fields, fill empty values with ""
# For all the other fields, fill empty values with 0
for col in nan_count_not_zero.index:
    if df[col].dtypes == "object":
        df[col].fillna(value="", inplace=True)
    else:
        df[col].fillna(value=0, inplace=True)

nan_count = np.sum(df.isnull(), axis = 0)

In [13]:
# Identify columns with text data
to_encode = list(df.columns[df.dtypes == "object"])
to_encode

['name', 'description', 'neighborhood_overview', 'amenities']

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

My plan is to split up the text and numerical features of the data into several data sets, and train models on them to compare their effectiveness. I will use a neural network for the numerical data, and TF-IDF on the non-numerical data.

I chose to remove a few features that I felt didn't have relevance to the prediction. I also converted the boolean values into numeric features.

I will build the model, compute its accuracy, and then change model hyperparameters to improve upon its accuracy.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow.keras as keras
import time
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import re

2024-08-07 02:36:33.145092: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-08-07 02:36:33.145140: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [15]:
# Separate the datasets into description and numeric features
df_descr = df[to_encode + ["Good AirBnB"]]
df_numeric = df.drop(columns = to_encode, axis = 0)

In [17]:
# Neural network to process the numeric features

# 1. Create model object:
nn_model = keras.Sequential()


# 2. Create the input layer and add it to the model object: 
# Create input layer:
input_layer = keras.layers.InputLayer(input_shape = (X_train_n.shape[1],), name = "input")
# Add input_layer to the model object:
# YOUR CODE HERE
nn_model.add(input_layer)

# 3. Create the first hidden layer and add it to the model object:
# Create hidden layer:
hidden_layer_1 = keras.layers.Dense(units = 64, activation = "relu", name = "L1")
# Add hidden_layer_1 to the model object:
# YOUR CODE HERE
nn_model.add(hidden_layer_1)

# 4. Create the second hidden layer and add it to the model object:
# Create hidden layer:
hidden_layer_2 = keras.layers.Dense(units = 32, activation = "relu", name = "L2")
# Add hidden_layer_2 to the model object:
# YOUR CODE HERE
nn_model.add(hidden_layer_2)


# 5. Create the third hidden layer and add it to the model object:
# Create hidden layer:
hidden_layer_3 = keras.layers.Dense(units = 16, activation = "relu", name = "L3")
# Add hidden_layer_3 to the model object:
# YOUR CODE HERE
nn_model.add(hidden_layer_3)

# 6. Create the output layer and add it to the model object:
# Create output layer:
output_layer = keras.layers.Dense(units = 1, activation = "sigmoid", name = "output")
# Add output_layer to the model object:
# YOUR CODE HERE
nn_model.add(output_layer)


# Print summary of neural network model structure
nn_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
L1 (Dense)                   (None, 64)                2880      
_________________________________________________________________
L2 (Dense)                   (None, 32)                2080      
_________________________________________________________________
L3 (Dense)                   (None, 16)                528       
_________________________________________________________________
output (Dense)               (None, 1)                 17        
Total params: 5,505
Trainable params: 5,505
Non-trainable params: 0
_________________________________________________________________


2024-08-07 02:36:37.739250: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2024-08-07 02:36:37.739297: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2024-08-07 02:36:37.739425: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (i-0d0790fcd65e4e204): /proc/driver/nvidia/version does not exist
2024-08-07 02:36:37.739770: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [18]:
sgd_optimizer = keras.optimizers.SGD(learning_rate = 0.9)
loss_fn = keras.losses.BinaryCrossentropy(from_logits = False)
nn_model.compile(optimizer = sgd_optimizer, loss = loss_fn, metrics = ["accuracy"])

In [19]:
class ProgBarLoggerNEpochs(keras.callbacks.Callback):
    
    def __init__(self, num_epochs: int, every_n: int = 50):
        self.num_epochs = num_epochs
        self.every_n = every_n
    
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.every_n == 0:
            s = 'Epoch [{}/ {}]'.format(epoch + 1, self.num_epochs)
            logs_s = ['{}: {:.4f}'.format(k.capitalize(), v)
                      for k, v in logs.items()]
            s_list = [s] + logs_s
            print(', '.join(s_list))

In [20]:
t0 = time.time() # start time

num_epochs = 100 # epochs

history = nn_model.fit(X_train_n,
                       y_train_n,
                       epochs = num_epochs,
                       verbose = 0,
                       callbacks = [ProgBarLoggerNEpochs(num_epochs, every_n = 5)],
                       validation_split = 0.2)


t1 = time.time() # stop time

print('Elapsed time: %.2fs' % (t1-t0))

2024-08-07 02:36:37.922195: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2024-08-07 02:36:37.927302: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2650000000 Hz


Epoch [5/ 100], Loss: 0.1748, Accuracy: 0.9579, Val_loss: 0.1681, Val_accuracy: 0.9600
Epoch [10/ 100], Loss: 0.1750, Accuracy: 0.9579, Val_loss: 0.1678, Val_accuracy: 0.9600
Epoch [15/ 100], Loss: 0.1750, Accuracy: 0.9579, Val_loss: 0.1681, Val_accuracy: 0.9600
Epoch [20/ 100], Loss: 0.1750, Accuracy: 0.9579, Val_loss: 0.1684, Val_accuracy: 0.9600
Epoch [25/ 100], Loss: 0.1749, Accuracy: 0.9579, Val_loss: 0.1681, Val_accuracy: 0.9600
Epoch [30/ 100], Loss: 0.1749, Accuracy: 0.9579, Val_loss: 0.1681, Val_accuracy: 0.9600
Epoch [35/ 100], Loss: 0.1749, Accuracy: 0.9579, Val_loss: 0.1679, Val_accuracy: 0.9600
Epoch [40/ 100], Loss: 0.1748, Accuracy: 0.9579, Val_loss: 0.1678, Val_accuracy: 0.9600
Epoch [45/ 100], Loss: 0.1750, Accuracy: 0.9579, Val_loss: 0.1679, Val_accuracy: 0.9600
Epoch [50/ 100], Loss: 0.1748, Accuracy: 0.9579, Val_loss: 0.1687, Val_accuracy: 0.9600
Epoch [55/ 100], Loss: 0.1748, Accuracy: 0.9579, Val_loss: 0.1683, Val_accuracy: 0.9600
Epoch [60/ 100], Loss: 0.1750, Ac

In [21]:
loss, accuracy = nn_model.evaluate(X_test_n, y_test_n)

print('Loss: {0} Accuracy: {1}'.format(loss, accuracy))

Loss: 0.18307356536388397 Accuracy: 0.9551813006401062


In [22]:
# Make predictions on the test set
probability_predictions = nn_model.predict(X_test_n)
class_label_predictions=[]

for i in range(0,len(y_test_n)):
    if probability_predictions[i] >= 0.6:
        class_label_predictions.append(1)
    else:
        class_label_predictions.append(0)

print('Confusion Matrix for the model: ')
c_m = confusion_matrix(y_test_n, class_label_predictions, labels=[True, False])
# Create a Pandas DataFrame out of the confusion matrix for display purposes
pd.DataFrame(
c_m,
columns=['Predicted: Not good AirBnB', 'Predicted: Good AirBnB'],
index=['Actual: Not good AirBnB', 'Actual: Good AirBnB']
)

Confusion Matrix for the model: 


Unnamed: 0,Predicted: Not good AirBnB,Predicted: Good AirBnB
Actual: Not good AirBnB,6692,0
Actual: Good AirBnB,314,0


In [33]:
# TF-IDF for the non-numeric features

def logistic_tfidf(X, y, col_name):
    print("column: ", col_name)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.6, random_state=1234)

    # 1. Create a TfidfVectorizer object
    tfidf_vectorizer = TfidfVectorizer()
    
    # 2. Fit the vectorizer to X_train
    tfidf_vectorizer.fit(X_train)
          
    # 4. Transform *both* the training and test data using the fitted vectorizer and its 'transform' attribute
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    # 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
    model = LogisticRegression(max_iter=200)
    model.fit(X_train_tfidf, y_train)
    
    # 2. Make predictions on the transformed test data using the predict_proba() method and 
    # save the values of the second column
    probability_predictions = model.predict_proba(X_train_tfidf)[:,1]
    
    # 3. Make predictions on the transformed test data using the predict() method 
    class_label_predictions = model.predict(X_test_tfidf)
    
    # 4. Compute the Area Under the ROC curve (AUC) for the test data. Note that this time we are using one 
    # function 'roc_auc_score()' to compute the auc rather than using both 'roc_curve()' and 'auc()' as we have 
    # done in the past
    acc_score = accuracy_score(y_test, class_label_predictions)
    print('Accuracy score on the test data: {:.4f}'.format(acc_score))

In [34]:
for col in to_encode:
    logistic_tfidf(df[col], df["Good AirBnB"], col)

column:  name
Accuracy score on the test data: 0.9554
column:  description
Accuracy score on the test data: 0.9554
column:  neighborhood_overview
Accuracy score on the test data: 0.9554
column:  amenities
Accuracy score on the test data: 0.9554


I found that results varied for both methods. For the neural network, sometimes the results were very accurate, and sometimes they were not. I adjusted the learning rate to be about 0.75, which improved results. For TF-IDF, every encoded co

For further exploration of the data, I would add hidden layers to the neural network and change some of the layer sizes to improve on the accuracy.