# [TEST] - sa_lr_pyspark_main

## Execution: Main Script

### The necessary libraries for the model are imported

In [None]:
from pyspark.sql import SparkSession
from preprocessing.sa_lr_pyspark_preprocessing import preprocess_data
from training.sa_lr_pyspark_training import train_model

### Script Test

Initializes a Spark session named "SocialApp" or retrieves an existing one if available.

In [None]:
spark = SparkSession.builder.appName("SocialApp").getOrCreate()

URL of the Google Sheets document that contains the data.

In [None]:
filename = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQrsbFzUZtypCv80I7lGN4qs1m56Qss5X54FzTH-gb0lx569sjkRKCtSRemMhF1tca38rVu-mQFhbez/pubhtml?gid=817597830&single=true'

The preprocess_data function is called with the Spark session and the specified URL to process the data and creates a data set.

In [None]:
dataset = preprocess_data(spark, filename)

The train_model function is called on the data set and returns a logistic regression model (lrModel) and a test data set (testData).

In [None]:
lrModel, testData = train_model(dataset)

The logistic regression model is used to make predictions on the test data set, resulting in a new DataFrame called predictions.

In [None]:
predictions = lrModel.transform(testData)

Filters the predictions DataFrame, selecting specific columns ('clean_text', 'category', 'probability', 'label', 'prediction') and sorting them by the 'probability' column in descending order. Finally, the top 10 rows with truncated text are displayed.

In [None]:
predictions.filter(predictions['prediction'] == 0).select("clean_text", "category", "probability", "label", "prediction")\
    .orderBy("probability", ascending=False).show(n=10, truncate=30)

Convert the DataFrame to a pandas DataFrame, then save the pandas DataFrame to a CSV file called 'all_predictions.csv'

In [None]:
predictions.select("clean_text", "category", "probability", "label", "prediction")\
    .toPandas().to_csv('all_predicciones.csv', index=False)

It stops the Spark session, releasing associated resources.

In [None]:
spark.stop()