##### Download Training Data and Upload to Azure Data Lake Storage Container

Please copy traning data from 
https://drive.google.com/file/d/1uNsbvMDz7Zz5cyskjNe0HL5LLXlpVvdX/view?usp=share_link to your ADLS container. 

This data is used for model training. 

In this notebook, we will use machine learning and apply NLP techniques to train a machine learning model. The model will use `Reviews` data to predict ratings 

What we are going to do:
- Step 1: Prepare the training data for the machine learning training. 
- Step 2: Train the machine learning model;
- Step 3: Save the model to a Azure storage folder so that you can use it for future prediction. 

#### Initialise SparkSession

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ModelTraining').getOrCreate()

#### Set Path to Training Data and Read to Dataframe

In [2]:
# Set file path to the location of the downloaded training data in Azure datalake
file_location = 'mnt/bd-Project/yelp-training-data/*'


reviews = spark.read\
  .parquet(file_location)

#### Print Training Data Schema

In [3]:
reviews.printSchema()

root
 |-- business_id: string (nullable = true)
 |-- cool: long (nullable = true)
 |-- date: string (nullable = true)
 |-- funny: long (nullable = true)
 |-- review_id: string (nullable = true)
 |-- stars: double (nullable = true)
 |-- text: string (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)



In [4]:
reviews.show(5)

+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|               date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|ORL4JE6tz3rJxVqkd...|   0|2015-03-22 19:01:49|    0|RdDRv8WuATj_19ltu...|  4.0|I remember stayin...|     0|FPOLMElOP7Xpqlwgo...|
|RCy4M2ND4YK0uRbod...|   0|2015-04-02 16:52:19|    0|EMJDxWocRuQ-6HsVT...|  4.0|Convenient locati...|     2|vScaSrM91Z43ypSR9...|
|VN2CJfXX6ooJt-Nc3...|   1|2014-11-18 15:31:43|    0|zLD4GdfIjaXZF-cUH...|  5.0|It's huge- you do...|     1|RPrbFB_bcot5TdNvj...|
|GInRkBWvuyJCjFVHY...|   0|2018-03-24 00:18:33|    0|L4DxZ-PGArRpOdEGS...|  1.0|I'm not sure what...|     0|XVCAuOwGZHwtytJal...|
|M4kHDHNzftSUtgpgy...|   0|2017-12-07 19:35:05|    0|OBpNmfGl2ysBpLJ1_...|  5.0|Finally, f

In [5]:
reviews = reviews.select("text", "stars")
reviews.show(5)

+--------------------+-----+
|                text|stars|
+--------------------+-----+
|I remember stayin...|  4.0|
|Convenient locati...|  4.0|
|It's huge- you do...|  5.0|
|I'm not sure what...|  1.0|
|Finally, finally ...|  5.0|
+--------------------+-----+
only showing top 5 rows



In [6]:
# Saving the dataframe to cache for repetitive use
reviews.cache()
reviews.count()

698757

#### Prepare Data for Usage

In [7]:
from pyspark.sql.functions import lower, regexp_replace


def prepare_data(df):
    # Select only the "text" and "stars" columns from the DataFrame
    df = df.select("text", "stars")
    
    cleaned = (
        # Convert to lowercase
        df.withColumn("text", lower(df.text))  

        # Remove non-alphabetic characters
        .withColumn("text", regexp_replace("text", "[^a-zA-Z\\s]", "")) 

        # Replace multiple spaces with one 
        .withColumn("text", regexp_replace("text", "\\s+", " "))  
    )
    
    # Return the cleaned DataFrame with processed text
    return cleaned

#### Creates & Train pipeline for text classification using Logistic Regression

In [8]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.ml.feature import IDF, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

def create_and_train_pipeline(cleaned):
    # Split the cleaned DataFrame into training and testing sets (80% train, 10% test)
    train, test = cleaned.randomSplit([0.8, 0.1], seed=2024)
    
    # Initialize a tokenizer to convert the "text" column into individual tokens
    tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
    
    # Initialize a stopword remover to filter out common stopwords from the tokens
    stopword_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered")
    
    # Create a count vectorizer to convert the filtered tokens into a feature vector
    cv = CountVectorizer(vocabSize=2**16, inputCol="filtered", outputCol='cv')
    
    # Initialize an IDF transformer to adjust the feature vector based on document frequency
    idf = IDF(inputCol='cv', outputCol="features", minDocFreq=5)
    
    # Create a label encoder to convert the "stars" column into a numerical label
    label_encoder = StringIndexer(inputCol="stars", outputCol="label")
    
    # Initialize a logistic regression model with a maximum of 100 iterations
    lr = LogisticRegression(maxIter=100, regParam=0.1, elasticNetParam=0.0)
    
    # Create a pipeline that chains together all the stages defined above
    pipeline = Pipeline(stages=[tokenizer, stopword_remover, cv, idf, label_encoder, lr])
    
    # Fit the pipeline model to the training data
    pipeline_model = pipeline.fit(train)
    
    # Transform the test data using the fitted pipeline model to generate predictions
    predictions = pipeline_model.transform(test)
    
    # Return the predictions and the fitted pipeline model
    return predictions, pipeline_model


#### Evaluates model's accuracy on the test set

In [9]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


# Prepare the data by cleaning it and selecting relevant columns
cleaned_data = prepare_data(reviews)

# Create and train the machine learning pipeline, obtaining predictions and the fitted model
predictions, pipeline_model = create_and_train_pipeline(cleaned_data)

# Initialize an evaluator to assess the model's performance using accuracy as the metric
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

# Evaluate the predictions made by the model and compute the accuracy
accuracy = evaluator.evaluate(predictions)

# Print the accuracy of the model on the test set
print(f"Test set accuracy = {accuracy}")


Test set accuracy = 0.6579255353353483



#### Save the Model file to Azure storage

In [10]:
# Saves pipeline model object to  mnt/bd-Project/LRModel directory
pipeline_model.save('mnt/bd-Project/LRModel')

# Access the fitted label encoder model from the pipeline model since it is the second to last stage
le_model = pipeline_model.stages[-2]

# Saves label encoder model to  mnt/bd-Project/StringIndexer directory
le_model.save('mnt/bd-Project/StringIndexer')

print('models successfully saved to specified locations')

models successfully saved to specified locations
