# Misconceptions in Mathematics

## Introduction
In the evolving landscape of education, the ability to accurately identify and address misconceptions in student understanding is paramount. As multiple-choice questions (MCQs) remain a staple assessment tool, the challenge of tagging distractors—incorrect answers crafted to reflect specific misconceptions—has become increasingly complex. This competition invites participants to develop a Natural Language Processing (NLP) model powered by Machine Learning (ML) to predict the affinity between misconceptions and distractors in MCQs.

The primary objective of this project is to create a model that not only aligns with established misconceptions but also adapts to new, emerging ones. By analyzing a dataset of diagnostic questions, where each distractor is designed to capture a particular misconception, we aim to streamline the tagging process for educators. This will not only enhance the consistency of tagging across various human labelers but also improve the educational experience for students by ensuring that misconceptions are properly addressed.

Given the intricacies of mathematical content and the limitations of initial attempts using pre-trained language models, our approach will focus on refining the tagging process to produce high-quality, actionable insights. Throughout this notebook, we will engage in exploratory data analysis (EDA), feature engineering, and the development of classification models. We will evaluate our models using the Mean Average Precision @ 25 (MAP@25) metric to ensure their effectiveness in predicting relevant misconceptions.

Ultimately, this project aims to contribute to the understanding and management of misconceptions in education, paving the way for more effective teaching strategies and improved student outcomes. Let’s commence by loading the necessary libraries and the dataset for our analysis.

## Table of Contents
1. [Package Installation](#package-installation)
2. [Library Imports](#library-imports)
3. [Data Loading](#data-loading)
4. [Initial Data Exploration](#initial-data-exploration)
5. [Data Preparation](#data-preparation)
6. [Data Visualization](#data-visualization)
7. [Modeling](#modeling)
8. [Model Evaluation](#model-evaluation)
9. [Conclusion and Next Steps](#conclusion-and-next-steps)

## Package Installation

In this cell, we will install the necessary Python packages for our data analysis project. This step ensures that all the libraries required for data manipulation, visualization, and natural language processing are available in our environment. We will use the `pip` command to install the following libraries:

- **NumPy**: A library for numerical calculations and array manipulation.
- **Pandas**: A powerful library for data manipulation and analysis, particularly useful for working with structured data.
- **Matplotlib**: A plotting library for creating static, animated, and interactive visualizations in Python.
- **Seaborn**: A statistical data visualization library based on Matplotlib that provides a high-level interface for creating attractive graphics.
- **Scikit-learn**: A machine learning library that provides simple and efficient tools for data mining and data analysis.
- **Torch**: A deep learning framework that provides a flexible and efficient platform for building neural networks.
- **Transformers**: A library from Hugging Face that provides pre-trained models and tools for natural language processing tasks.

In [16]:
# List of required libraries
required_libraries = [
    'numpy',         
    'pandas',        
    'matplotlib',    
    'seaborn',       
    'scikit-learn',  
    'torch',         
    'transformers'   
]

def install(package):
    """Install the package using pip in a Jupyter Notebook."""
    print(f"Installing {package}...")
    # Use the Jupyter magic command for installation
    get_ipython().system(f'pip install {package}')

def check_libraries(libraries):
    """Check if the libraries are installed and install them if necessary."""
    missing_libraries = []

    for library in libraries:
        try:
            __import__(library)
        except ImportError:
            missing_libraries.append(library)
        except Exception as e:
            # Captura outros erros que podem ocorrer durante a importação
            print(f"Error importing {library}: {e}")
            missing_libraries.append(library)

    if missing_libraries:
        print(f"The following libraries are missing: {', '.join(missing_libraries)}")
        print("Starting installation...")

        installation_success = True  # Flag to track installation success

        for library in missing_libraries:
            try:
                install(library)
                print(f"{library} installed successfully.")
            except Exception as e:
                print(f"Failed to install {library}: {e}")
                installation_success = False  # Mark as failed if there was an error

        # Check again if the libraries were installed
        for library in missing_libraries:
            try:
                __import__(library)
            except ImportError:
                print(f"Error: {library} was not installed correctly.")
                installation_success = False  # Mark as failed if still missing

        # Final message based on installation success
        if installation_success:
            print("All libraries were installed successfully.")
        else:
            print("Some libraries were not installed correctly.")
    else:
        print("All libraries are already installed.")

if __name__ == "__main__":
    check_libraries(required_libraries)

All libraries are already installed.


## Library Imports <a name="library-imports"></a>
In this cell, we will import all the necessary libraries that we will use throughout the analysis. This includes libraries for data manipulation, visualization, and machine learning.

In [1]:
# Standard Libraries
import warnings
from IPython.display import display

# Data Science Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Libraries
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

# Deep Learning Libraries
import torch
from torch.utils.data import DataLoader

# Transformers Library
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

  from .autonotebook import tqdm as notebook_tqdm


## Data Loading <a name="data-loading"></a>
Here, we will load the dataset containing information about Misconceptions in Mathematics and related factors. We will examine the structure of the data and check for any initial issues such as missing values.

In [18]:
warnings.filterwarnings('ignore')

# Reading the CSV files
train_df = pd.read_csv('/kaggle/input/eedi-mining-misconceptions-in-mathematics/train.csv')
test_df = pd.read_csv('/kaggle/input/eedi-mining-misconceptions-in-mathematics/test.csv')
miss_df = pd.read_csv('/kaggle/input/eedi-mining-misconceptions-in-mathematics/misconception_mapping.csv')
sample_df = pd.read_csv('/kaggle/input/eedi-mining-misconceptions-in-mathematics/sample_submission.csv')

In [19]:
# Display the first few rows of dataset to verify the reading
print("\nMisconceptions in Mathematics Data:")
print(train_df.head())


Misconceptions in Mathematics Data:
   QuestionId  ConstructId                                      ConstructName  \
0           0          856  Use the order of operations to carry out calcu...   
1           1         1612  Simplify an algebraic fraction by factorising ...   
2           2         2774            Calculate the range from a list of data   
3           3         2377  Recall and use the intersecting diagonals prop...   
4           4         3387  Substitute positive integer values into formul...   

   SubjectId                                        SubjectName CorrectAnswer  \
0         33                                             BIDMAS             A   
1       1077                    Simplifying Algebraic Fractions             D   
2        339  Range and Interquartile Range from a List of Data             B   
3         88                       Properties of Quadrilaterals             C   
4         67                          Substitution into Formula        

In [20]:
# Using shape to check the dimensions of the combined DataFrame
print("\nShape of Combined Data:", train_df.shape)


Shape of Combined Data: (1869, 15)


## Initial Data Exploration <a name="initial-data-exploration"></a>
In this section, we will conduct an exploratory data analysis (EDA) to gain a deeper understanding of our dataset. This analysis is crucial as it lays the foundation for the subsequent steps in our data cleaning and preparation process. We will examine various aspects of the data, including data types, summary statistics, and the presence of any missing values.

In [21]:
# Check for missing values
missing_values = train_df.isnull().sum()
print("Missing Values in Each Column:\n", missing_values)

Missing Values in Each Column:
 QuestionId            0
ConstructId           0
ConstructName         0
SubjectId             0
SubjectName           0
CorrectAnswer         0
QuestionText          0
AnswerAText           0
AnswerBText           0
AnswerCText           0
AnswerDText           0
MisconceptionAId    734
MisconceptionBId    751
MisconceptionCId    789
MisconceptionDId    832
dtype: int64


In [22]:
# Check data types
data_types = train_df.dtypes
print("\nData Types of Each Column:\n", data_types)


Data Types of Each Column:
 QuestionId            int64
ConstructId           int64
ConstructName        object
SubjectId             int64
SubjectName          object
CorrectAnswer        object
QuestionText         object
AnswerAText          object
AnswerBText          object
AnswerCText          object
AnswerDText          object
MisconceptionAId    float64
MisconceptionBId    float64
MisconceptionCId    float64
MisconceptionDId    float64
dtype: object


In [23]:
# Correcting data types
train_df['QuestionId'] = train_df['QuestionId'].astype('category')  # Ensure QuestionId is int
train_df['ConstructId'] = train_df['ConstructId'].astype('category')  # Ensure ConstructId is int
train_df['SubjectId'] = train_df['SubjectId'].astype('category')  # Ensure SubjectId is int
train_df['CorrectAnswer'] = train_df['CorrectAnswer'].astype(str)  # Ensure CorrectAnswer is str
train_df['ConstructName'] = train_df['ConstructName'].astype(str)  # Ensure ConstructName is str
train_df['SubjectName'] = train_df['SubjectName'].astype(str)  # Ensure SubjectName is str
train_df['QuestionText'] = train_df['QuestionText'].astype(str)  # Ensure QuestionText is str
train_df['AnswerAText'] = train_df['AnswerAText'].astype(str)  # Ensure AnswerAText is str
train_df['AnswerBText'] = train_df['AnswerBText'].astype(str)  # Ensure AnswerBText is str
train_df['AnswerCText'] = train_df['AnswerCText'].astype(str)  # Ensure AnswerCText is str
train_df['AnswerDText'] = train_df['AnswerDText'].astype(str)  # Ensure AnswerDText is str

# For misconceptions, since they can be NaN, we can convert them to integers but also allow NaN values
train_df['MisconceptionAId'] = train_df['MisconceptionAId'].astype('category')  # Use 'Int64' to allow NaNs
train_df['MisconceptionBId'] = train_df['MisconceptionBId'].astype('category')  # Use 'Int64' to allow NaNs
train_df['MisconceptionCId'] = train_df['MisconceptionCId'].astype('category')  # Use 'Int64' to allow NaNs
train_df['MisconceptionDId'] = train_df['MisconceptionDId'].astype('category')  # Use 'Int64' to allow NaNs

# Check the corrected data types
print("\nCorrected Data Types:")
print(train_df.dtypes)


Corrected Data Types:
QuestionId          category
ConstructId         category
ConstructName         object
SubjectId           category
SubjectName           object
CorrectAnswer         object
QuestionText          object
AnswerAText           object
AnswerBText           object
AnswerCText           object
AnswerDText           object
MisconceptionAId    category
MisconceptionBId    category
MisconceptionCId    category
MisconceptionDId    category
dtype: object


## Data Preparation <a name="data-preparation"></a>
In order to build effective machine learning models for our NLP competition, we must first ensure that our data is prepared thoroughly. The data preparation process is critical for achieving optimal model performance and involves several key steps.

First, we need to handle any missing values that may exist in our dataset. This is essential to prevent any disruptions in the training process and to ensure that our model can learn from complete data. Next, we will encode categorical variables, which allows our machine learning algorithms to interpret the data correctly and make accurate predictions.

Additionally, we will split the dataset into training and testing sets. This division is crucial as it enables us to evaluate our model's performance on unseen data, helping us avoid overfitting and ensuring that our model generalizes well.

We will also analyze the distribution of question lengths to understand the complexity of the data we are working with. Understanding this distribution can provide insights into how to preprocess our text data effectively. Cleaning the text data is another vital step, as it enhances the quality of the input provided to the model, ultimately leading to improved predictions

Below is the code that implements these data preparation steps, creating prediction and test dataframes from our original dataset, and concatenating the relevant texts to form a complete input for our models.

In [24]:
def create_prediction_dataframe(df):
    # Create a list to store the predictions
    pred_list = []

    # Add the questions and answers
    for index, row in df.iterrows():
        for answer in ['A', 'B', 'C', 'D']:
            # Check if the current answer is equal to the correct answer
            if answer != row['CorrectAnswer']:
                pred_list.append({
                    'QuestionId': row['QuestionId'],
                    'Answer': answer,
                    'QuestionText': row['QuestionText'],
                    'MisconceptionId': row[f'Misconception{answer}Id'],
                    'AnswerText': row[f'Answer{answer}Text'] 
                })

    # Create a dataframe from the list
    pred_data = pd.DataFrame(pred_list)
    return pred_data


def create_test_dataframe(df):
    # Create a list to store the test data
    test_list = []

    # Add the questions and answers
    for index, row in df.iterrows():
        for answer in ['A', 'B', 'C', 'D']:
            test_list.append({
                'QuestionId': row['QuestionId'],
                'Answer': answer,
                'QuestionText': row['QuestionText'],
                'AnswerText': row[f'Answer{answer}Text'] 
            })

    # Create a dataframe from the list
    test_data = pd.DataFrame(test_list)
    return test_data


# Create Prediction DataFrame for the training set
train_pred_data = create_prediction_dataframe(train_df)

train_pred_data['QuestionAnswer'] = train_pred_data['QuestionText'] + train_pred_data['AnswerText']
train_pred_data.dropna(inplace=True)

test_pred_data = create_test_dataframe(test_df)

test_pred_data['QuestionAnswer'] = test_pred_data['QuestionText'] + test_pred_data['AnswerText']

# Display the resulting DataFrames
display(train_pred_data)

Unnamed: 0,QuestionId,Answer,QuestionText,MisconceptionId,AnswerText,QuestionAnswer
2,0,D,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...,1672.0,Does not need brackets,\[\n3 \times 2+4-5\n\]\nWhere do the brackets ...
3,1,A,"Simplify the following, if possible: \( \frac{...",2142.0,\( m+1 \),"Simplify the following, if possible: \( \frac{..."
4,1,B,"Simplify the following, if possible: \( \frac{...",143.0,\( m+2 \),"Simplify the following, if possible: \( \frac{..."
5,1,C,"Simplify the following, if possible: \( \frac{...",2142.0,\( m-1 \),"Simplify the following, if possible: \( \frac{..."
6,2,A,Tom and Katie are discussing the \( 5 \) plant...,1287.0,Only\nTom,Tom and Katie are discussing the \( 5 \) plant...
...,...,...,...,...,...,...
5602,1867,C,Tom and Katie are discussing congruence and si...,2312.0,Both Tom and Katie,Tom and Katie are discussing congruence and si...
5603,1867,D,Tom and Katie are discussing congruence and si...,2312.0,Neither is correct,Tom and Katie are discussing congruence and si...
5604,1868,A,Jo and Paul are arguing about how to fully des...,801.0,Only\nJo,Jo and Paul are arguing about how to fully des...
5605,1868,C,Jo and Paul are arguing about how to fully des...,801.0,Both Jo and Paul,Jo and Paul are arguing about how to fully des...


## Modeling <a name="modeling"></a>
In this section, we will build and evaluate machine learning models aimed at predicting instances of cyberbullying based on the textual content of tweets. Our primary objective is to select suitable algorithms, fit them to our training data, and assess their performance using a variety of metrics.

To initiate our modeling process, we first load and save a pre-trained BERT tokenizer and model. These components are essential for processing the textual data and transforming it into a format suitable for our machine learning algorithms.

In [25]:
# # Load and save locally
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=miss_df['MisconceptionId'].nunique())

# # Save locally
# tokenizer.save_pretrained('./local_bert')
# model.save_pretrained('./local_bert')

Next, we will load our training data, which consists of tweet texts and their corresponding labels indicating instances of cyberbullying. We split this data into training and validation sets to ensure that we can evaluate our model's performance on unseen data.

In [None]:
# Load the data
X = train_pred_data['QuestionAnswer']
y = train_pred_data['MisconceptionId'].astype(int)  # You can use a multi-label approach

# Split into training and validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

After that, we load the BERT tokenizer and model from our local directory to prepare for the tokenization of our data. Tokenization is a crucial step that converts the raw text into numerical representations that the model can understand.

In [None]:
# Load the BERT tokenizer and model from local directory
tokenizer = BertTokenizer.from_pretrained('./local_bert')
model = BertForSequenceClassification.from_pretrained('./local_bert', num_labels=miss_df['MisconceptionId'].nunique())

We then proceed to tokenize our training and validation datasets, applying truncation and padding to ensure uniform input sizes.

In [None]:
# Tokenization of the data
train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(list(X_val), truncation=True, padding=True, max_length=128)

To facilitate the training process, we create a custom dataset class that transforms our tokenized data into PyTorch tensors, which are necessary for model training.

In [None]:
# Create PyTorch tensors
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels.to_numpy()  # Convert to NumPy array

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])  # Use idx directly
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CustomDataset(train_encodings, y_train)
val_dataset = CustomDataset(val_encodings, y_val)

Next, we set up the training configurations, specifying parameters such as batch size, number of epochs, and logging options. These configurations will guide the training process.

In [None]:
# Training configurations
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=True,
    num_train_epochs=12,
    gradient_accumulation_steps=2,  # Para simular um tamanho de lote maior
    logging_dir='./logs',
    report_to=["none"],
)

Finally, we initialize the Trainer class with our model and training arguments, and we proceed to train the model. After training, we evaluate its performance on the validation set.

In [26]:
# Training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

trainer.train()

# Evaluation
trainer.evaluate()

Step,Training Loss
500,6.7176


{'eval_loss': 6.938859939575195,
 'eval_runtime': 170.062,
 'eval_samples_per_second': 5.139,
 'eval_steps_per_second': 0.165,
 'epoch': 12.0}

By executing the above steps, we will successfully develop and evaluate our machine learning models, enabling us to predict instances of cyberbullying based on the textual content of tweets. This process not only enhances our understanding of the data but also provides valuable insights into the effectiveness of our chosen algorithm

## Model Evaluation <a name="model-evaluation"></a>
After building our models, it is crucial to evaluate their performance to determine how well they can predict instances of cyberbullying based on the textual content of tweets. In this section, we will assess each model using various metrics, including accuracy, precision, recall, and F1-score. We will compare the predictions against the actual outcomes to gain insights into the effectiveness of our models.

### Evaluation Steps
- Extract Predictions: We will first extract the question-answer pairs from the test dataset and create a unique identifier for each question-answer combination.
- Tokenization: The test dataset will be tokenized using the same BERT tokenizer that we used for training, ensuring consistency in input format.
- Custom Dataset Class: We will define a custom dataset class for the test data to facilitate batch processing during evaluation.
- DataLoader: A DataLoader will be created for the test dataset to allow for easy iteration through the data in batches.
- Make Predictions: We will use the trained model to make predictions on the test dataset, retrieving the top predictions for each input.
- Format Predictions: Finally, we will format the predictions for submission and save them in a CSV file.

Below is the code that implements these evaluation steps:

In [27]:
# Extract question-answer pairs from the test dataset
question_answers = test_pred_data['QuestionAnswer']

# Create 'question_id' by combining 'QuestionId' and 'Answer'
test_pred_data['question_id'] = test_pred_data['QuestionId'].astype(str) + '_' + test_pred_data['Answer'].astype(str)

# Tokenize the test dataset
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
test_encodings = tokenizer(list(question_answers), truncation=True, padding=True, max_length=128)

# Custom dataset class for the test data
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        # Convert each encoding to a tensor
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

    def __len__(self):
        # Return the number of input samples
        return len(self.encodings['input_ids'])

# Create DataLoader for the test dataset
test_dataset = CustomDataset(test_encodings)
test_loader = DataLoader(test_dataset, batch_size=8)

# Make predictions
predictions = []
question_ids = []  # Store corresponding question IDs
with torch.no_grad():  # Disable gradient calculation during evaluation
    for batch_idx, batch in enumerate(test_loader):
        outputs = model(**batch)
        logits = outputs.logits
        # Get probabilities using softmax
        probs = torch.softmax(logits, dim=1)
        # Retrieve indices of the top 25 predictions
        top_k = torch.topk(probs, k=25, dim=1).indices  # shape: (batch_size, 25)
        predictions.append(top_k)

        # Collect corresponding question IDs for each batch
        start_idx = batch_idx * test_loader.batch_size
        end_idx = start_idx + len(batch['input_ids'])
        question_ids.extend(test_pred_data['question_id'].iloc[start_idx:end_idx])

# Format the predictions for submission
final_predictions = []
for batch_preds in predictions:
    for pred in batch_preds:
        # Append the top 25 misconception IDs as a space-separated string
        final_predictions.append(' '.join(map(str, pred.numpy())))

# Create a DataFrame for the submission file
submission_df = pd.DataFrame({
    'QuestionId_Answer': question_ids,
    'MisconceptionId': final_predictions
})

# Save the DataFrame as a CSV file
submission_df.to_csv('submission.csv', index=False)

By following the steps outlined above, we can effectively evaluate our models and produce a submission file containing the top predictions for each question-answer pair. This evaluation process allows us to quantify the performance of our models and make necessary adjustments or improvements based on the results. The metrics derived from the evaluation will provide valuable insights into the model's strengths and weaknesses in predicting instances of cyberbullying.

## Conclusion and Next Steps <a name="conclusion-and-next-steps"></a>
In this project, we developed machine learning models to detect various forms of cyberbullying on Twitter. Our analysis revealed significant insights into the prevalence of different types of cyberbullying, with religion and age being the most common categories. The models achieved varying levels of accuracy, with XGBoost performing the best at 82.96%.

### Recommendations:
- Implement Monitoring Tools: Social media platforms should consider implementing automated monitoring tools to detect and flag harmful content in real-time.
- Educational Initiatives: Schools and organizations should promote awareness programs about the effects of cyberbullying and encourage positive online behavior.

### Future Work:
- Real-Time Detection: Explore the implementation of the model as a real-time detection system for social media platforms.
- Deep Learning Approaches: Investigate the use of deep learning techniques to improve accuracy in text classification tasks.

By continuing to refine our models and approaches, we can contribute to creating a safer online environment for all users.