# Word Difficulty Classifier
This project classifies and ranks words by their difficulty, using both the words and their definitions. The project leverages the FlexibleScorer class to score words based on a specified criterion (such as difficulty), and visualizes the results with a horizontal bar plot using matplotlib.

## Features
- Scoring Words by Difficulty: Each word is assigned a difficulty score using the FlexibleScorer.
- Incorporates Definitions: The classifier considers both the word and its definition when calculating difficulty.
- Visualization: Generates a horizontal bar chart showing the top 30 words sorted by difficulty.
- Output: Saves the results as a CSV file and the bar chart as an image file

## Installation
To run this project locally, you need to have Python 3.x installed, along with the following dependencies:


```
pip install pandas matplotlib
```

Ensure the FlexibleScorer module is available in your Python path.

## Sample Output Files
- CSV File: The script will output a CSV file containing the words and their corresponding difficulty scores.
- Plot Image: The script will also generate an image file with a bar chart displaying the top 30 words by difficulty.

# Code
## Import Dependencies

In [2]:
!pip install pandas matplotlib
!pip install flexible-scorer

Collecting flexible-scorer
  Downloading flexible_scorer-0.1.14-py3-none-any.whl.metadata (4.1 kB)
Collecting openai (from flexible-scorer)
  Downloading openai-1.51.1-py3-none-any.whl.metadata (24 kB)
Collecting httpx<1,>=0.23.0 (from openai->flexible-scorer)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai->flexible-scorer)
  Downloading jiter-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai->flexible-scorer)
  Downloading httpcore-1.0.6-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai->flexible-scorer)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading flexible_scorer-0.1.14-py3-none-any.whl (5.8 kB)
Downloading openai-1.51.1-py3-none-any.whl (383 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.7/383.7 kB[0m [31m17.6 MB/s[0m eta [36m0:00:0

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import os

os.environ['OPENAI_API_KEY'] = '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'

from flexible_scorer import FlexibleScorer

## Score and Create DataFrame

In [None]:
csv_url = 'https://raw.githubusercontent.com/winston0753/vocab_difficulty/main/complete_word_list.csv'

# Specify the criteria for scoring (e.g., "difficulty")
criteria = "difficulty"
scorer = FlexibleScorer(criteria)

# Load the words CSV file
words_df = pd.read_csv(csv_url)

# Extract words and definitions
words = words_df['word']
definitions = words_df['definition']

# Score the words based on the word and its definition
scores = [(word, scorer.score(word, definition)) for word, definition in zip(words, definitions)]

# Sort the words by difficulty score (highest to lowest)
scores_sorted = sorted(scores, key=lambda x: x[1], reverse=True)

# Create a DataFrame from the sorted scores
df = pd.DataFrame(scores_sorted, columns=['Word', 'Difficulty'])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  Category 2: -10.588911
  Category 3: -5.213911
  Category 4: -2.713911
  Category 5: -0.213911
  Category 6: -2.213911
  Category 7: -4.463911
  Category 8: -9.963911
  Category 9: -18.088911
  Category 10: -22.151411

Normalized probabilities:
  Category 1: 7.458505e-09
  Category 2: 2.519384e-05
  Category 3: 5.440357e-03
  Category 4: 6.627711e-02
  Category 5: 8.074205e-01
  Category 6: 1.092725e-01
  Category 7: 1.151724e-02
  Category 8: 4.706830e-05
  Category 9: 1.393432e-08
  Category 10: 2.397532e-10

Weighted score: 5.055215
Final normalized score: 0.450579

Detailed logprobs for each token:

Token: 5
  5: -0.4698
    Mapped to category: 5
  6: -1.4698
    Mapped to category: 6
  4: -2.4698
    Mapped to category: 4
  7: -3.0948
    Mapped to category: 7
  3: -4.2198
    Mapped to category: 3
  8: -8.4698
    Mapped to category: 8
  2: -9.4698
    Mapped to category: 2
  9: -16.2198
    Mapped to category: 9


## Plot the top 30 words by difficulty

In [None]:
plt.figure(figsize=(10, 6))
plt.barh(df['Word'][:30], df['Difficulty'][:30], color='skyblue')
plt.xlabel('Difficulty')
plt.ylabel('Word')
plt.title('Top 30 Words by Difficulty (With Definitions)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.savefig('word_difficulty_plot_with_definitions.png')
plt.show()

## Save the sorted scores to a CSV file


In [None]:
df.to_csv('sorted_words_by_difficulty_with_definitions.csv', index=False)