## ChatGPT Transcripts Processing
This notebook processes a collection of `.txt` files located in the specified directory (`/Users/mts517/Desktop/NLP Analysis/TXT/`). Each `.txt` file follows the naming format `NetID_Ex#.txt`, where `NetID` is a unique identifier, and `Ex#` denotes an exercise number.

The script performs the following tasks for each `.txt` file:
1. Extracts the `NetID` and integer portion of the `Exercise` number from the filename.
2. Tokenizes the file content using "You" and "ChatGPT" as delimiters to identify each interaction.
3. Captures the student's prompt following "You" and ChatGPT's response following "ChatGPT."
4. Writes each interaction to a CSV file (`ChatGPT_Scripts.csv`) located at `/Users/mts517/Desktop/NLP Analysis/`, with the following columns:
   - **NetID**: Identifier from the filename.
   - **Exercise**: The integer exercise number.
   - **Student_prompt**: Text from the student's input.
   - **ChatGPT_answer**: Text from ChatGPT's response.
   - **Interaction_Sequence**: The sequence number of the interaction within the file.

The notebook continues this process until all `.txt` files in the directory are processed, resulting in a structured CSV file with each interaction organized sequentially.


In [7]:
import os
import csv
import re

# Define the directory path and output CSV file path
directory_path = "/Users/mts517/Desktop/NLP Analysis/TXT/"
output_csv_path = "/Users/mts517/Desktop/NLP Analysis/ChatGPT_Scripts.csv"

# Open the CSV file for writing
with open(output_csv_path, mode='w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)
    # Write the header
    writer.writerow(["NetID", "Exercise", "Student_prompt", "ChatGPT_answer", "Interaction_Sequence"])
    
    # Iterate over each .txt file in the specified directory
    for filename in os.listdir(directory_path):
        if filename.endswith(".txt"):
            # Extract NetID and Exercise from the filename (format: NetID_Ex#.txt)
            netid = filename.split("_")[0]
            exercise_match = re.search(r'\d+', filename.split("_")[1])  # Find integers in the exercise part
            exercise = int(exercise_match.group()) if exercise_match else None
            
            # Read the contents of the .txt file
            with open(os.path.join(directory_path, filename), 'r', encoding='utf-8') as file:
                lines = file.readlines()
                
                # Initialize variables to store prompts and answers
                student_prompt, chatgpt_answer = "", ""
                interaction_sequence = 1
                in_student_prompt = False
                in_chatgpt_answer = False
                
                # Process each line to capture interactions
                for line in lines:
                    line = line.strip()  # Remove whitespace characters
                    
                    # Check for "You" or "ChatGPT" delimiters
                    if line == "You":
                        # Save previous interaction if any
                        if student_prompt and chatgpt_answer:
                            writer.writerow([netid, exercise, student_prompt.strip(), chatgpt_answer.strip(), interaction_sequence])
                            interaction_sequence += 1
                            student_prompt, chatgpt_answer = "", ""
                        # Start capturing new Student prompt
                        in_student_prompt = True
                        in_chatgpt_answer = False
                    elif line == "ChatGPT":
                        # Start capturing ChatGPT answer
                        in_student_prompt = False
                        in_chatgpt_answer = True
                    else:
                        # Append lines to respective prompts or answers
                        if in_student_prompt:
                            student_prompt += " " + line
                        elif in_chatgpt_answer:
                            chatgpt_answer += " " + line
                
                # Write any last interaction to the CSV (if exists)
                if student_prompt and chatgpt_answer:
                    writer.writerow([netid, exercise, student_prompt.strip(), chatgpt_answer.strip(), interaction_sequence])


# Joining the two datasets

This script performs a left join operation on two CSV datasets located on the user's desktop, specifically targeting columns `NetID` and `Exercise`. The purpose is to merge data from the `ChatGPT_Scripts.csv` file and `Final_Eye_Tracking_Profiles_Merged.csv` file based on shared identifiers in `NetID` and `Exercise`. 

1. **Paths**: The paths are defined for input files (`ChatGPT_Scripts.csv` and `Final_Eye_Tracking_Profiles_Merged.csv`) and the output file (`Joined_Dataset.csv`).
2. **Loading Datasets**: The script reads both CSV files into pandas DataFrames.
3. **Left Join Operation**: A left join is performed on `NetID` and `Exercise` columns, resulting in a merged DataFrame that retains all rows from `ChatGPT_Scripts.csv` and adds matching rows from `Final_Eye_Tracking_Profiles_Merged.csv` where `NetID` and `Exercise` values match.
4. **Column Reordering**: After the merge, columns `Student_prompt`, `ChatGPT_answer`, and `Interaction_Sequence` are moved to the end of the DataFrame to ensure these specific columns are the last in the dataset.
5. **Saving the Output**: The final DataFrame is saved as `Joined_Dataset.csv` in the specified directory.

This merged dataset provides a comprehensive view of data from both sources, with prompt-answer interaction sequences organized at the end of each row for streamlined analysis.


In [12]:
import pandas as pd

# Define file paths
chatgpt_scripts_path = "/Users/mts517/Desktop/NLP Analysis/ChatGPT_Scripts.csv"
eye_tracking_profiles_path = "/Users/mts517/Desktop/NLP Analysis/Final_Eye_Tracking_Profiles_Merged.csv"
output_path = "/Users/mts517/Desktop/NLP Analysis/Joined_Dataset.csv"

# Load the datasets
chatgpt_df = pd.read_csv(chatgpt_scripts_path)
eye_tracking_df = pd.read_csv(eye_tracking_profiles_path)

# Perform a left join on 'NetID' and 'Exercise' columns
merged_df = pd.merge(chatgpt_df, eye_tracking_df, on=['NetID', 'Exercise'], how='left')

# Move 'Student_prompt', 'ChatGPT_answer', and 'Interaction_Sequence' to the end of the DataFrame
columns_order = [col for col in merged_df.columns if col not in ['Student_prompt', 'ChatGPT_answer', 'Interaction_Sequence']]
columns_order += ['Student_prompt', 'ChatGPT_answer', 'Interaction_Sequence']
merged_df = merged_df[columns_order]

# Save the merged dataset to a new CSV file
merged_df.to_csv(output_path, index=False)


# NetID Anonymization and filling the NaN values

This script anonymizes the `NetID` column in the joined dataset by replacing each unique `NetID` with a unique integer identifier, ensuring data privacy. It also fills any missing values in the dataset with `NaN`.

1. **File Paths**: Specifies the file path for the input dataset (`Joined_Dataset.csv`) and the output path for the anonymized dataset (`Anonymized_Joined_Dataset.csv`).
2. **Load Dataset**: Reads the joined dataset into a pandas DataFrame.
3. **Generate Anonymized Mapping**: Creates a mapping dictionary that assigns a unique integer to each distinct `NetID`, starting from 1.
4. **Anonymize `NetID`**: Replaces the original `NetID` values in the DataFrame with their respective integer identifiers from the mapping dictionary.
5. **Fill Missing Values**: Replaces any missing values across the DataFrame with `NaN` to ensure data completeness and consistency.
6. **Save Output**: Exports the anonymized and cleaned dataset as `Anonymized_Joined_Dataset.csv`.

The result is a dataset where `NetID` values are anonymized, and missing entries are marked as `NaN`, supporting privacy and data integrity for further analysis.


In [14]:
import pandas as pd
import numpy as np

# Define file path for the joined dataset
joined_dataset_path = "/Users/mts517/Desktop/NLP Analysis/Joined_Dataset.csv"
anonymized_output_path = "/Users/mts517/Desktop/NLP Analysis/Anonymized_Joined_Dataset.csv"

# Load the joined dataset
df = pd.read_csv(joined_dataset_path)

# Create a unique integer mapping for each unique NetID
netid_mapping = {netid: i for i, netid in enumerate(df['NetID'].unique(), start=1)}

# Replace NetID values with the corresponding integer
df['NetID'] = df['NetID'].map(netid_mapping)

# Fill missing values with NaN
df.fillna(np.nan, inplace=True)

# Save the anonymized dataset to a new CSV file
df.to_csv(anonymized_output_path, index=False)