# <center><span style="color: #00bfff; font-family: Arial, sans-serif;">Welcome to my sanctuary</span></center>
<div style="background-color: #f0e68c; padding: 20px; border-radius: 15px;">
<h2 style="color: #00008b; font-family: Arial, sans-serif;">Thank you for exploring my notebook. Feel free to customize or fork it according to your requirements. Your feedback and support are highly valued.Stay Tunned for Next kernels</h2>
</div>


# [Learning Agency Lab Automated Essay Scoring 2.0 Competition](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2)
![Automatic Essay Checking with Machine Learning](https://th.bing.com/th/id/OIG1.hauKIGeSwYx8ZtIiz1Fo?pid=ImgGn)


### Introduction
The Automated Essay Scoring (AES) competition aims to develop advanced algorithms for scoring student essays automatically. By automating this process, we can alleviate the burden on teachers and provide timely feedback to students, especially in underserved communities.

### Background
The first AES competition took place twelve years ago, marking a significant milestone in the field. Since then, there have been notable advancements in technology and methodologies. However, challenges such as dataset limitations and algorithmic bias persist.

### Objectives
1. **Dataset Enhancement**: Prior competitions faced limitations due to small, non-diverse datasets. The current competition addresses this by providing a large, diverse dataset aligned with modern educational standards.
  
2. **Algorithm Improvement**: The goal is to surpass the performance of previous competitions, such as the Automated Student Assessment Prize (ASAP) competition held in 2012. This entails developing open-source scoring algorithms that are more accurate and efficient.

### Competition Structure
- **Host**: Vanderbilt University in collaboration with The Learning Agency Lab.
  
- **Timeline**: The competition runs from April 2, 2024, to July 2, 2024.
  
- **Evaluation Metric**: Submissions are scored based on the quadratic weighted kappa, which measures agreement between predicted and actual scores. The higher the kappa, the better the agreement.

### Submission Requirements
- **Code Competition**: Submissions must be made through Notebooks. 
- **Submission Format**: Each submission file must contain essay IDs and their corresponding predicted scores.
- **Prizes**: Leaderboard prizes are awarded based on performance, while efficiency prizes focus on both runtime and predictive performance.
  
### Impact and Acknowledgments
- **Impact**: Automated essay scoring can lead to more accessible educational tools and support for both students and educators.
  
- **Acknowledgments**: The competition organizers thank supporters like the Bill & Melinda Gates Foundation, Schmidt Futures, and Chan Zuckerberg Initiative for their contributions.

> ### Notebook Objective:
This notebook focuses on utilizing simple ML libraries, FLAML for automated machine learning, and TF-IDF for text feature extraction. The goal is to efficiently conduct Exploratory Data Analysis (EDA) and build predictive models.

### Methodology:
We'll start by exploring the dataset using pandas and matplotlib for EDA. Then, FLAML will be employed for automated model selection and hyperparameter tuning. Finally, TF-IDF will be utilized to extract features from text data.

### Conclusion:
By integrating these techniques, we aim to streamline the data analysis process and develop accurate predictive models effectively.

---

> # 1. Libraries

In [None]:
# Import subprocess module to run shell commands
import subprocess

# Define the pip install command for FLAML
install_command = 'pip install flaml'

try:
    # Use subprocess to run the pip install command
    subprocess.run(install_command, shell=True, check=True)
    print("FLAML installed successfully!")
except subprocess.CalledProcessError as e:
    # If an error occurs, print the error message
    print(f"Error installing FLAML: {e}")


In [None]:
#!pip install flaml
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time,os
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import lightgbm as lgb
from flaml import AutoML
import polars as pl
import nltk
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')


> # 2. Importing Datset

In [None]:
# Read train and test datasets
train = pl.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv')
test = pl.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/test.csv')
submission=pl.read_csv('/kaggle/input/learning-agency-lab-automated-essay-scoring-2/sample_submission.csv')


## `head & tail of dataset`

In [None]:
# Display the first few rows of each dataset
print("train:")
print(train.head())
print("\ntest :")
print(test.head())


## `Length`

In [None]:
print(len(train))
print(len(test))

## `Summary`

In [None]:
# Summary statistics for numerical columns
print("\nSummary statistics for train dataset:")
print(train.describe())


In [None]:
print("Summary statistics for test dataset :")
print(test.describe())

> # 4.Vectorization and LGBM
- In this  uses TF-IDF vectorization to convert text data into numerical features. It then employs AutoML to find the best hyperparameters for LightGBM, a gradient boosting algorithm, and trains a model for text classification.

In [None]:
%%time
#This code hrlp from that notebook-->[https://www.kaggle.com/code/davidjlochner/base-tfidf-lgbm/notebook]
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(min_df=.05)
train_tfid = vectorizer.fit_transform(train['full_text'])
test_tfid = vectorizer.transform(test['full_text'])

train_y = np.array(train['score'])

# Initialize AutoML for hyperparameter optimization
aml = AutoML()

# Fit AutoML to find the best hyperparameters
aml.fit(train_tfid, train_y, estimator_list=['lgbm'], task='classification', metric='macro_f1', time_budget=600)

# Retrieve the best hyperparameters found by AutoML
best_config = aml.best_config

# Initialize LGBMClassifier with the best hyperparameters found
model = lgb.LGBMClassifier(**best_config)

# Train the model
model.fit(train_tfid, train_y)

> # 5.Submission

In [None]:
# Predict scores for test data using the model
submission = test.select('essay_id').with_columns(score=model.predict(test_tfid))

# Display the submission data
display(submission)

# Write the submission to a CSV file
submission.write_csv('submission.csv')

# Add insights
print("Submission generated successfully.")


---

## ``In summary, continuous refinement of model parameters and exploration of advanced algorithms are pivotal for enhancing accuracy. Collaboration and knowledge-sharing within the community offer valuable insights for achieving breakthroughs. With each iteration, we propel forward, contributing to the evolution of machine learning capabilities.``