# Data reduction of Falcon generated Dataset

## Overview
This notebook serves as a post-processing tool for adjusting the output generated by the Falcon model. The Falcon model has been run on multiple batches, and this notebook is designed to handle the consolidation of these batches into a refined and usable format. The primary goal is to process CSV output data, stored in the 'scripts' folder, by merging batches, removing rows with None values in the 'score' field, and enhancing relevance data based on the score.

## Workflow

1. Merging data from multiple Falcon model batches.
2. Remove rows with None values in the 'score' field.
3. Adjust relevance metrics based on scores.
4. Save refined data as a CSV file for further analysis.

In [None]:
import sys

sys.path.append("scripts")
import processing_csv

sys.path.append("scripts/plotting")
import dataset_length_distribution

In [None]:
# all output files to adjust and save as CSV
txt_files = [
    "/1/output.txt",
    "/2/output.txt",
    "/3/output.txt",
    "/4/output.txt",
    "/1/output-21123.txt",
    "/2/output-22247.txt",
    "/3/output-23370.txt",
    "/4/output-24488.txt",
    # Add more file paths as needed
]
folder_path = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/batch/"
output_csv_file = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/data/new_dataset_tweet_er5_01_2017.csv"  # Replace with the path to your CSV file
processing_csv.process_files_and_append_to_csv(folder_path, txt_files, output_csv_file)

In [None]:
# read input csv and if any score value is None, remove that row
input_csv_file = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/data/new_dataset_tweet_er5_01_2017.csv"
output_csv_file = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/data/refined_score_dataset_tweet_er5_01_2017.csv"
processing_csv.fill_missing_scores(input_csv_file, output_csv_file)

In [None]:
# add new column says 'relevance' and set 1 if 'score'>0.5 else 0
input_csv_file = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/data/refined_score_dataset_tweet_er5_01_2017.csv"
output_csv_file = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/data/relivance_score_dataset_tweet_er5_01_2017.csv"
processing_csv.add_relevance_column(input_csv_file, output_csv_file)

In [None]:
# read csv file and plot the disctribution
csv_file = "/p/project/deepacf/maelstrom/haque1/AP2-Social-media-data-for-better-local-forecasts/data/relivance_score_dataset_tweet_er5_01_2017.csv"
df = pd.read_csv(csv_file)
dataset_length_distribution.plot_numeric_distribution_value(
    df, "relevance", bins=2, title="Distribution of relevance and not relevance data", x_label="XM", y_label="Y"
)