In [None]:
#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

# Neural Networks

## Importing & Setting up the Data
Import the file, "creditCardDefaultReduced.csv", and save it in a variable called `df`. 

In [None]:
import pandas as pd
df = pd.read_csv("creditCardDefaultReduced.csv")
df

### Outcome Variable
Next, create your `outcome` variable:

In [None]:
outcome = df["Payment"]

### Features Variable
For the `features` variable, first save the numeric features:

In [None]:
numericFeatures = df[["Limit_Bal", "Bill_Amt1", "Pay_Amt1", "Age"]]

Next, create dummy variables for your categorical variables:

In [None]:
dummiesMarriage = pd.get_dummies(df["Marriage"], prefix = "Marriage", drop_first = True)
dummiesCard = pd.get_dummies(df["Card"], prefix = "Card", drop_first = True)
dummiesPay_0 = pd.get_dummies(df["Pay_0"], prefix = "Pay_0", drop_first = True)

Now combine the numeric features and dummy variables:

In [None]:
features = pd.concat([numericFeatures, dummiesMarriage, dummiesPay_0, dummiesCard], axis = 1)

### Partition the Data

Let's partition the data into training and test data sets:

In [None]:
from sklearn.model_selection import train_test_split
featuresTrain, featuresTest, outcomeTrain, outcomeTest = train_test_split(features, 
                                                                          outcome, 
                                                                          test_size = 0.33, 
                                                                          random_state = 42)

### Scale the Features

Similar to the Support Vector Machine model, it's a good idea to scale your features with Neural Network models:

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
featuresTrain_norm = scaler.fit_transform(featuresTrain) #you fit to the training features
featuresTest_norm = scaler.transform(featuresTest)       #you only transform the test features

## Building a Neural Network Model

Remember, there was a 4 step process for building our models last class:
1. Set-up the model
2. Fit the training data to the model
3. Predict outcomes using the model
4. Assess the fit of the model

Refer to the code from last class to build a neural network model. Here are several pieces of information you'll need:

* The package is `sklearn.neural_network`
* The function we'll be using is `MLPClassifier()`
    - Set the `random_state` to 42
    - You'll also need to specify `hidden_layer_sizes = (30,30)` inside the `MLPClassifier()` function
* Get the classification report for both the training and test predictions
* Get the confusion matrix for the test predictions

# Natural Language Processing

Natural language processing (NLP) models are models designed to process human language for a variety of tasks, such as sentiment analysis, summarizing text, translating text, and -- as we've seen with ChatGPT -- answer questions.

ChatGPT is a large language model that uses a version of neural network deep learning called Generative Pre-trained Transformers (GPT) that are trained on 175 billion parameters that allow the model to weigh the importance of different words relative to each other.

## Sentiment Analysis

To demonstrate how NLP models work (at a level that our laptops can handle), let's build a sentiment analysis model that can classify whether a film review is positive or negative.

First, let's import our training dataset of 8,957 reviews from IMDb.com:

In [None]:
df2 = pd.read_excel("IMDB_reduced.xlsx")
df2

For this model, our `outcome` variable is `label` (0 = negative review, 1 = positive review) and our `feature` is the review `text`. In the next code cell, create an `outcomeIMDB` and `featureIMDB` variable:

Now partition the data into training and test data sets, using `random_state = 42` and a `test_size` of 33%:

### Vectorizing the Text

Now, we need to convert our review text into numeric data. To do this, we'll use the `TfidfVectorizer()`. Similar to the scaler we used above, we first need to initialize the vectorizer:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize

vectorizer = TfidfVectorizer(max_features = 500, tokenizer = casual_tokenize)

The Term Frequency (TF) and Inverse Document Frequency (IDF) score weights each word in a review based on how many times it appears in that review and how unique or important it is across all reviews in the training data set. A higher number means it's more important.

The tokenizer we are using (`casual_tokenize`) splits the review into individual words and can handle punctuation and special characters. We are setting `max_features` to only tokenize the top 500 most important words or characters. 

Next, we transform our training and test features using the vectorizer, again similar to the scaler we used above:

In [None]:
featureIMDBTrain_v = vectorizer.fit_transform(featureIMDBTrain)
featureIMDBTest_v = vectorizer.transform(featureIMDBTest)

### Building the Model

Now we can build a neural network model using these vectorized features. Copy/paste your neural network code from above and replace the features and outcome variables as needed. Also use the following settings:

* Set the `random_state` to 42
* Use (10,10,10) for the `hidden_layer_sizes`
* Name your model `modReview`

Be sure to print out the classification reports and the confusion matrix for the test data.

### Using the Model

Now that we've built our model, we can use it to classify new film reviews:

In [None]:
# New film review
new_review = ["I really enjoyed the film. It kept me entertained the whole time."]

# Tokenize and vectorize the new review using the same vectorizer
new_review_v = vectorizer.transform(new_review)

# Predict sentiment on the new review
modReview.predict(new_review_v)          # predicts 0 or 1
modReview.predict_proba(new_review_v)    # provides probability of 0 or 1