<a href="https://colab.research.google.com/github/tonykipkemboi/30days-Swahili/blob/master/gpt_4o_data_analyst.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analysis with GPT-4o

This notebook demonstrates how to use GPT-4o to analyze a Pandas DataFrame.

We will:
1. Set up the environment by installing necessary libraries.
2. Load a CSV file into a Pandas DataFrame.
3. Use GPT-4o to generate analysis based on user questions.

## Step 1: Set Up the Environment

First, we need to install the required libraries. These include `pandas` for data manipulation, & `openai` for accessing GPT-4.


In [44]:
# Install the required libraries
!pip install --q pandas openai

In [25]:
# Import necessary libraries and set up environment variables
import pandas as pd
from openai import OpenAI
from google.colab import userdata

# Set up OpenAI API key
client = OpenAI(
    api_key=userdata.get('OPENAI_API_KEY'),
)
print(client)

<openai.OpenAI object at 0x7e3f6eaba4d0>


## Step 2: Define Functions for Analysis
We define three main functions:

1. `load_data`: This function loads data from a CSV file into a Pandas DataFrame.
2. `ask_gpt4`: This function sends a question and the DataFrame to GPT-4 and returns the response.
3. `analyze_with_e2b`: This function uses the E2B Code Interpreter to securely execute code for analysis.

In [8]:
# Function to load data into a DataFrame
def load_data(file_path):
    return pd.read_csv(file_path)

In [45]:
def summarize_dataframe(df):
    summary = df.describe().to_csv()
    column_info = df.dtypes.to_string()
    return summary, column_info

def sample_dataframe(df, n=5):
    sample = df.sample(n=n)
    return sample.to_csv(index=False)

In [39]:
# Function to ask GPT-4 for analysis
def ask_gpt4(question, dataframe, use_summary=False, sample_size=5):
    response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You're an helpful and expert data analyst"
          }
        ]
      },
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": f"Analyze the following data and answer the question: \n{question}\\n\\nData:\\n{dataframe.to_csv()}"
          }
        ]
      }
    ],
    temperature=0,
    )
    return response.choices[0].message.content

# for chunk in stream:
#   if chunk.choices[0].delta.content is not None:
#       print(chunk.choices[0].delta.content, end="")


## Step 3: Test the Functions
In this section, we will test our functions by uploading a CSV file, loading it into a DataFrame, and asking a question to both GPT-4 and E2B.

In [41]:
from google.colab import files
uploaded = files.upload()

for file_name in uploaded.keys():
    df = load_data(file_name)
    break

# Display the DataFrame
df.head()

Saving candy-data.csv to candy-data.csv


Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [43]:
# prompt: Using dataframe df: least sugary and cheapest]

df.sort_values(by=['sugarpercent', 'pricepercent']).head(1)

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086


In [46]:
# Define a user question
user_question = "What is the least sugary, highest win, and cheapest?"

# Use GPT-4 to get an analysis
gpt4_analysis = ask_gpt4(user_question, df)
print("GPT-4 Analysis:", gpt4_analysis)

GPT-4 Analysis: To determine the least sugary, highest win, and cheapest candy, we need to analyze the data based on three criteria: sugar percentage, win percentage, and price percentage. Let's break down the analysis step-by-step:

1. **Least Sugary**: Identify the candy with the lowest sugar percentage.
2. **Highest Win**: Identify the candy with the highest win percentage.
3. **Cheapest**: Identify the candy with the lowest price percentage.

### Step 1: Least Sugary
From the data, the candy with the lowest sugar percentage is:
- **One dime** and **One quarter** both have a sugar percentage of 0.011.

### Step 2: Highest Win
From the data, the candy with the highest win percentage is:
- **Reese’s Peanut Butter cup** with a win percentage of 84.18029.

### Step 3: Cheapest
From the data, the candy with the lowest price percentage is:
- **Tootsie Roll Midgies** with a price percentage of 0.011.

### Combined Criteria
To find a candy that meets all three criteria (least sugary, highes