# Project 2 AI Chat Notebook - 

In this notebook, we want to see your interaction with ChatGPT that you use to solve
the unsupervised AND supervised machine learning tasks in `project2.ipynb`.  


# ChatGPT functions to Use

You are free to use these functions, or other LLMs, or the Web ChatGPT, as long as you show those systems' outputs
so we are convinced you have used a large language model to perform the machine learning tasks.


In [5]:
from openai import OpenAI

client = OpenAI(
    # This is the default and can be omitted
	base_url = 'http://aitools.cs.vt.edu:7860/openai/v1',
    api_key="aitools"
)


In [6]:

def converse(prompt, messages=None, model="gpt-3.5-turbo", max_tokens=2000, temperature=0, top_p=1, frequency_penalty=0,
			 presence_penalty=0):
    # Add the user's message to the list of messages
    if messages is None:
        messages = []

    messages.append({"role": "user", "content": prompt})

    response = client.chat.completions.create(
                model=model,
		messages=messages,
		temperature=temperature,
		max_tokens=max_tokens,
		top_p=top_p,
		frequency_penalty=frequency_penalty,
		presence_penalty=presence_penalty).choices[0].message.content

    # Add the assistant's message to the list of messages
    messages.append({"role": "assistant", "content": response})

    return response, messages

# Unsupervised Learning

In this section, show us how you performed the unsupervised learning tasks with the help of Chat GPT. 

In [7]:
import glob

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer


In [8]:
prompt = """
- Load all data files matching 'data/datasets/train-00000-5k-*.parquet'.
- Merge those dataframes into a single dataframe "df".
"""
response, messages = converse(prompt, [])
print(response)

To load all data files matching the pattern 'data/datasets/train-00000-5k-*.parquet' and merge them into a single dataframe, you can use the following code:

```python
import pandas as pd
import glob

# Get a list of all file paths matching the pattern
file_paths = glob.glob('data/datasets/train-00000-5k-*.parquet')

# Load each file into a dataframe and store them in a list
dfs = [pd.read_parquet(file) for file in file_paths]

# Merge all dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)
```

This code uses the `glob` module to find all file paths matching the pattern. It then loads each file into a dataframe using `pd.read_parquet()` and stores them in a list called `dfs`. Finally, it merges all dataframes in the `dfs` list into a single dataframe called `df` using `pd.concat()`. The `ignore_index=True` parameter ensures that the resulting dataframe has a continuous index.


In [6]:
prompt = """
- Define the code snippets (from data frames's "code" attribute)
- Define labels (from dataframe's "language" attribute)
"""
response, messages = converse(prompt, messages)
print(response)

To define the code snippets from the data frames' "code" attribute and the labels from the dataframe's "language" attribute, you can use the following code:

```python
# Define the code snippets
code_snippets = df['code'].tolist()

# Define the labels
labels = df['language'].tolist()
```

In this code, `df['code']` retrieves the values from the "code" column of the dataframe `df` and `tolist()` converts it into a list. This creates a list of code snippets stored in the variable `code_snippets`.

Similarly, `df['language']` retrieves the values from the "language" column of the dataframe `df` and `tolist()` converts it into a list. This creates a list of labels stored in the variable `labels`.

Now, you can use the `code_snippets` list and `labels` list for further analysis or modeling tasks.


In [7]:
prompt = """
- Convert code snippets to TF-IDF features
"""
response, messages = converse(prompt, messages)
print(response)

To convert code snippets to TF-IDF (Term Frequency-Inverse Document Frequency) features, you can use the `TfidfVectorizer` class from the `sklearn.feature_extraction.text` module. Here's an example code snippet:

```python
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert code snippets to TF-IDF features
vectorizer = TfidfVectorizer()
tfidf_features = vectorizer.fit_transform(code_snippets)
```

In this code, `TfidfVectorizer` is initialized without any additional parameters, which means it will use the default settings for tokenization, stop words, and n-gram range. You can customize these settings by passing appropriate parameters to the `TfidfVectorizer` constructor.

The `fit_transform()` method of the `TfidfVectorizer` class is then used to convert the code snippets (stored in the `code_snippets` list) into TF-IDF features. The resulting `tfidf_features` variable will be a sparse matrix representation of the TF-IDF features.

You can now use the `tfidf_feature

In [24]:
prompt = """
- # enter code fragment here
import pandas as pd
import glob

# Get a list of all file paths matching the pattern
file_paths = glob.glob('data/datasets/train-00000-5k-*.parquet')

# Load each file into a dataframe and store them in a list
dfs = [pd.read_parquet(file) for file in file_paths]

# Merge all dataframes into a single dataframe
df = pd.concat(dfs, ignore_index=True)

# Extract code snippets from the "code" attribute
code_snippets = df['code'].tolist()

# Extract labels from the "language" attribute
labels = df['language'].tolist()# enter code fragment here
- use above code for below question
- Perform K-Means clustering ( use 5 clusters )
"""
response, messages = converse(prompt, [])
print(response)

To perform K-Means clustering with 5 clusters on the code snippets, you can use the following code:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Convert code snippets into TF-IDF vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(code_snippets)

# Perform K-Means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)

# Get the cluster labels for each code snippet
cluster_labels = kmeans.labels_

# Print the cluster labels
print(cluster_labels)
```

This code first converts the code snippets into TF-IDF vectors using the `TfidfVectorizer` from scikit-learn. Then, it performs K-Means clustering with 5 clusters using the `KMeans` class. Finally, it assigns cluster labels to each code snippet and prints the cluster labels.


In [25]:
prompt = """
- Print the cluster assignments with each cluster being truncated to 50 characters long.
  Print 6 example of each cluster. include dot dot dot at the end of the line if it is longer then 50 characters. Change all internal new lines in to space for each snippet.
"""
response, messages = converse(prompt, messages)
print(response)

To print the cluster assignments with each cluster being truncated to 50 characters long, you can modify the code as follows:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Convert code snippets into TF-IDF vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(code_snippets)

# Perform K-Means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)

# Get the cluster labels for each code snippet
cluster_labels = kmeans.labels_

# Create a dictionary to store snippets for each cluster
cluster_snippets = {i: [] for i in range(5)}

# Assign each snippet to its respective cluster
for i, snippet in enumerate(code_snippets):
    cluster_snippets[cluster_labels[i]].append(snippet.replace('\n', ' '))

# Print 6 examples of each cluster
for cluster in range(5):
    print(f"Cluster {cluster}:")
    for i, snippet in enumerate(cluster_snippets[cluster][:6]):
        if len(snippet) > 50:
            snippet = snippet[:50] 

In [26]:
prompt = """
- Generate code to make a Scatter Plot with Cluster Centers (use mathplotlib). Use a distinct color for each cluster.
  take only 20 observation for each cluster in the plot
"""
response, messages = converse(prompt, messages)
print(response)

To generate a scatter plot with cluster centers using matplotlib, with a distinct color for each cluster and considering only 20 observations for each cluster, you can use the following code:

```python
import matplotlib.pyplot as plt

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

# Get the first 20 observations for each cluster
selected_indices = []
for cluster in range(5):
    cluster_indices = [i for i, label in enumerate(cluster_labels) if label == cluster]
    selected_indices.extend(cluster_indices[:20])

# Get the selected code snippets and their corresponding cluster labels
selected_snippets = [code_snippets[i] for i in selected_indices]
selected_labels = [cluster_labels[i] for i in selected_indices]

# Convert code snippets into TF-IDF vectors
X_selected = vectorizer.transform(selected_snippets)

# Perform dimensionality reduction for visualization
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_selected.toarra

# Supervised Learning
In this section, show us how you performed the supervised learning tasks with the help of Chat GPT. 

In [30]:
prompt = """
- I am writing code for Supervised Learning, below are the initial steps.
- Load all data files matching 'data/datasets/train-00000-5k*.parquet'.
- For each file loaded, merge the latest data file with the merged data to date,
- and call the learn function with the dataframe and the index of the file in the list of files.
"""
response, messages = converse(prompt, [])
print(response)

Here is an example code that follows the initial steps you mentioned:

```python
import pandas as pd
import glob

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        learn(merged_data, i)
        
def learn(dataframe, file_index):
    # Your learn function implementation goes here
    # This function will be called with the merged dataframe and the index of the file
    
    # Example implementation
    print(f"Learning from file {file_index}:")
    print(dataframe.head())

# Call the load_data_files function to start the process
load_data_files()
```

In this code, the `load_data_files` function loads all the data files matching the given pattern using the `glob` module. It then initializes an empty dataframe `merged_data` to store the merged data.

For each file loaded, it

In [31]:
prompt = """
- add below instruction in learn function in the previous code
- Define the code snippets (from data frames's "code" attribute)
- Define labels (from dataframe's "language" attribute)
- Split the code snippets and labels into training and test sets
"""
response, messages = converse(prompt, messages)
print(response)

Sure! Here's an updated version of the code that includes the additional instructions in the `learn` function:

```python
import pandas as pd
import glob
from sklearn.model_selection import train_test_split

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        learn(merged_data, i)
        
def learn(dataframe, file_index):
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Your further learning algorithm implementation goes here
    # You can use X_train, X_test, y_train, y_test for training and testing
    
    # Example impl

In [32]:
prompt = """
- Create a TF-IDF vectorizer
- Use the 'fit_transform' method on the training data to learn the vocabulary and idf, and return term-document matrix.
- Use the 'transform' method on the test data to transform documents to document-term matrix.
"""
response, messages = converse(prompt, messages)
print(response)

To create a TF-IDF vectorizer and use it to transform the training and test data, you can modify the `learn` function as follows:

```python
import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        learn(merged_data, i)
        
def learn(dataframe, file_index):
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Learn 

In [33]:
prompt = """
- Create a Support Vector Machine classifier
- Train the classifier using the training data
- Save the model to a file and load it back from a file (to make sure it works)
"""
response, messages = converse(prompt, messages)
print(response)

To create a Support Vector Machine (SVM) classifier, train it using the training data, and save/load the model, you can modify the `learn` function as follows:

```python
import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        learn(merged_data, i)
        
def learn(dataframe, file_index):
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Creat

In [42]:
prompt = """
-{import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        learn(merged_data, i)
        
def learn(dataframe, file_index):
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Learn vocabulary and idf, and return term-document matrix for training data
    X_train_tfidf = vectorizer.fit_transform(X_train)
    
    # Transform test data to document-term matrix
    X_test_tfidf = vectorizer.transform(X_test)
    
    # Create SVM classifier
    svm_classifier = SVC()
    
    # Train the classifier using the training data
    svm_classifier.fit(X_train_tfidf, y_train)
    
    # Save the model to a file
    model_file = f"svm_model_{file_index}.joblib"
    joblib.dump(svm_classifier, model_file)
    
    # Load the model from the file
    loaded_model = joblib.load(model_file)
    
    # Your further learning algorithm implementation goes here
    # You can use loaded_model, X_test_tfidf, y_test for testing
    
    # Example implementation
    print(f"Learning from file {file_index}:")
    print("Training set:")
    print(X_train_tfidf.shape)
    print(y_train.head())
    print("Test set:")
    print(X_test_tfidf.shape)
    print(y_test.head())

# Call the load_data_files function to start the process
    load_data_files()}
- add below instruction in learn function in the above code and return all code
- Use the classifier to predict the labels for the test data
"""
response, messages = converse(prompt, [])
print(response)

# Your further learning algorithm implementation goes here
# You can use loaded_model, X_test_tfidf, y_test for testing

# Example implementation
print(f"Learning from file {file_index}:")
print("Training set:")
print(X_train_tfidf.shape)
print(y_train.head())
print("Test set:")
print(X_test_tfidf.shape)
print(y_test.head())

# Use the classifier to predict the labels for the test data
y_pred = loaded_model.predict(X_test_tfidf)

# Print the predicted labels
print("Predicted labels:")
print(y_pred)

# Return all code
return merged_data, vectorizer, svm_classifier, X_train_tfidf, X_test_tfidf, y_train, y_test, y_pred


In [43]:
prompt = """
- Print the classification report which should be a dictionary
- Store the training time in the dictionary under 'train'
- Store the prediction time in the dictionary under 'predict'
- Store the classification report in the dictionary under 'report'
- Add 'accuracy' to the dictionary and set it to the accuracy score from the classification report
- Append the dictionary to the measurements list for each round
"""
response, messages = converse(prompt, messages)
print(response)

# Your further learning algorithm implementation goes here
# You can use loaded_model, X_test_tfidf, y_test for testing

# Example implementation
print(f"Learning from file {file_index}:")
print("Training set:")
print(X_train_tfidf.shape)
print(y_train.head())
print("Test set:")
print(X_test_tfidf.shape)
print(y_test.head())

# Use the classifier to predict the labels for the test data
start_time = time.time()
y_pred = loaded_model.predict(X_test_tfidf)
end_time = time.time()

# Calculate the training time
train_time = end_time - start_time

# Calculate the prediction time
start_time = time.time()
y_pred = loaded_model.predict(X_test_tfidf)
end_time = time.time()
predict_time = end_time - start_time

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)

# Calculate the classification report
classification_report_dict = classification_report(y_test, y_pred, output_dict=True)

# Create a dictionary to store the measurements
measurements = {
    'train': train_time,
  

In [44]:
prompt = """
- in the above code i am getting error: 'name 'time' is not define'
"""
response, messages = converse(prompt, messages)
print(response)

Apologies for the oversight. The error is occurring because the `time` module has not been imported. To resolve this issue, you need to add `import time` at the beginning of your code, along with the other import statements. Here's the updated code:

```python
import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib
import time

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    
    for i, file in enumerate(file_list):
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        learn(merged_data, i)
        
def learn(dataframe, file_index):
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_tr

In [48]:
prompt = """
-{import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib
import time

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    measurements_list=[]
    for i, file in enumerate(file_list):
        
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        measurements= learn(merged_data, i)
        measurements_list.append(measurements)
    return measurements_list
        
def learn(dataframe, file_index):
    
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Learn vocabulary and idf, and return term-document matrix for training data
    X_train_tfidf = vectorizer.fit_transform(X_train)
    
    # Transform test data to document-term matrix
    X_test_tfidf = vectorizer.transform(X_test)
    
    # Create SVM classifier
    svm_classifier = SVC()
    
    # Train the classifier using the training data
    start_time = time.time()
    svm_classifier.fit(X_train_tfidf, y_train)
    end_time = time.time()
    
    train_time = end_time - start_time
    
    # Save the model to a file
    model_file = f"svm_model_{file_index}.joblib"
    joblib.dump(svm_classifier, model_file)
    
    # Load the model from the file
    loaded_model = joblib.load(model_file)
    # Use the classifier to predict the labels for the test data
    start_time = time.time()
    y_pred = loaded_model.predict(X_test_tfidf)
    end_time = time.time()
    
    predict_time = end_time - start_time
    accuracy = accuracy_score(y_test, y_pred)
    
    # Calculate the classification report
    classification_report_dict = classification_report(y_test, y_pred, output_dict=True)
    
    # Create a dictionary to store the measurements
    measurements = {
        'train': train_time,
        'predict': predict_time,
        'report': classification_report_dict,
        'accuracy': accuracy
    }
    # Your further learning algorithm implementation goes here
    # You can use loaded_model, X_test_tfidf, y_test for testing
    
    # Example implementation
    print(f"Learning from file {file_index}:")
    print("Training set:")
    print(X_train_tfidf.shape)
    print(y_train.head())
    print("Test set:")
    print(X_test_tfidf.shape)
    print(y_test.head())
    # Print the predicted labels
    print("Predicted labels:")
    print(y_pred.head())
    return measurements

# Call the load_data_files function to start the process
measurements_list= load_data_files()
print(measurements_list)}
- in the above code i am getting below error
- error: 'name 'accuracy_score' is not defined'
- error: 'name 'classification_report' is not defined'
"""
response, messages = converse(prompt, [])
print(response)

The error message indicates that the functions `accuracy_score` and `classification_report` are not defined. These functions are part of the `sklearn.metrics` module, so you need to import them in order to use them.

To fix the error, add the following import statements at the beginning of your code:

```python
from sklearn.metrics import accuracy_score, classification_report
```

This will import the necessary functions and resolve the error.


In [49]:
prompt = """
- in the above code i am getting error: ''numpy.ndarray' object has no attribute 'head'' for print(y_pred.head()) line 
"""
response, messages = converse(prompt, messages)
print(response)

The error message `'numpy.ndarray' object has no attribute 'head'` suggests that you are trying to use the `head()` method on a NumPy array, which is not supported. The `head()` method is specific to pandas DataFrame and Series objects.

In your code, `y_pred` is a NumPy array containing the predicted labels. To print the first few elements of the array, you can use indexing or slicing. Here's an example:

```python
print("Predicted labels:")
print(y_pred[:5])  # Print the first 5 elements of the array
```

This will print the first 5 predicted labels from the `y_pred` array. Adjust the number `5` as needed to display the desired number of elements.


In [51]:
prompt = """
-{import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib
import time
from sklearn.metrics import accuracy_score, classification_report

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    measurements_list=[]
    for i, file in enumerate(file_list):
        
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        measurements= learn(merged_data, i)
        measurements_list.append(measurements)
    return measurements_list
        
def learn(dataframe, file_index):
    
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Create TF-IDF vectorizer
    vectorizer = TfidfVectorizer()
    
    # Learn vocabulary and idf, and return term-document matrix for training data
    X_train_tfidf = vectorizer.fit_transform(X_train)
    
    # Transform test data to document-term matrix
    X_test_tfidf = vectorizer.transform(X_test)
    
    # Create SVM classifier
    svm_classifier = SVC()
    
    # Train the classifier using the training data
    start_time = time.time()
    svm_classifier.fit(X_train_tfidf, y_train)
    end_time = time.time()
    
    train_time = end_time - start_time
    
    # Save the model to a file
    model_file = f"svm_model_{file_index}.joblib"
    joblib.dump(svm_classifier, model_file)
    
    # Load the model from the file
    loaded_model = joblib.load(model_file)
    # Use the classifier to predict the labels for the test data
    start_time = time.time()
    y_pred = loaded_model.predict(X_test_tfidf)
    end_time = time.time()
    
    predict_time = end_time - start_time
    accuracy = accuracy_score(y_test, y_pred)
    
    # Calculate the classification report
    classification_report_dict = classification_report(y_test, y_pred, output_dict=True)
    
    # Create a dictionary to store the measurements
    measurements = {
        'train': train_time,
        'predict': predict_time,
        'report': classification_report_dict,
        'accuracy': accuracy
    }
    # Your further learning algorithm implementation goes here
    # You can use loaded_model, X_test_tfidf, y_test for testing
    
    # Example implementation
    print(f"Learning from file {file_index}:")
    print("Training set:")
    print(X_train_tfidf.shape)
    print(y_train.head())
    print("Test set:")
    print(X_test_tfidf.shape)
    print(y_test.head())
    # Print the predicted labels
    print("Predicted labels:")
    # print(y_pred.head())
    print(y_pred[:5]) 
    return measurements

# Call the load_data_files function to start the process
measurements_list= load_data_files()
print(measurements_list)}
- consider above code and write a new function which takes index, train, predict, accuracy from measurements_list list 
- let us plot lines on the same graph for tfidf, train, predict and accuracy using python?  Use matplotlib.
- Add a legend.  Add axis labels.  Add a title.
- Lets save the plot to a file "supervised-plotter.png" before showing the plot.
"""
response, messages = converse(prompt, [])
print(response)

Here is the updated code that includes the new function and the plotting functionality:

```python
import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
import joblib
import time
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

def load_data_files():
    file_list = glob.glob('data/datasets/train-00000-5k*.parquet')
    merged_data = pd.DataFrame()
    measurements_list=[]
    for i, file in enumerate(file_list):
        
        data = pd.read_parquet(file)
        merged_data = pd.concat([merged_data, data])
        measurements= learn(merged_data, i)
        measurements_list.append(measurements)
    return measurements_list

def learn(dataframe, file_index):
    
    # Extract code snippets and labels
    code_snippets = dataframe['code']
    labels = dataframe['language']
    
    # Split code snippets and labels i