<a href="https://colab.research.google.com/github/rezazamani2329/AIML-UC-Berkeley-Generative-AI/blob/main/Intro_to_HF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Hugging Face

This notebook demonstrates how to import a dataset from Hugging Face and how to use the libraries `pandas` and `OpenAI` to generate a narrative about the data contained in the dataset.

In [None]:
#install the pandas library using pip
#!pip install -q pandas

#import the necessary libraries
import pandas as pd
from openai import OpenAI


#import the dataset from Hugging Face
df = pd.read_csv("hf://datasets/KokilaSivakumar/Sales/sales.csv")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


The code in cell below displays the first five rows of the dataframe diplaying product sales.

In [None]:
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country/Region,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2021-152156,11/8/2021,11/11/2021,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2021-152156,11/8/2021,11/11/2021,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2021-138688,6/12/2021,6/16/2021,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2020-108966,10/11/2020,10/18/2020,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2020-108966,10/11/2020,10/18/2020,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


Suppose we want to analyze the total sales per city.

This code below  calculates the total sales per city the dataframe `df` and stores the result in a new dataframe called `total_sales`. Here's a breakdown of each part:

- `df.groupby('City')`: Groups the rows in the DataFrame by the values in the `City` column. This creates a grouping where each unique city has its own group.
- `['Sales'].sum()`: After grouping by city, this selects the 'Sales' column within each group and calculates the sum of sales values for each city.


In [None]:
total_sales = df.groupby('City')['Sales'].sum()

total_sales

Unnamed: 0_level_0,Sales
City,Unnamed: 1_level_1
Aberdeen,25.500
Abilene,1.392
Akron,2729.986
Albuquerque,2220.160
Alexandria,5519.570
...,...
Woonsocket,195.550
Yonkers,7657.666
York,817.978
Yucaipa,50.800


OpenAI requires the data to be in `string` format (i.e., non-numeric).

The code below converts the `total_sales` dataframe to a string format.

In [None]:
data_string = total_sales.to_string(index=False).replace('\n', ' ')
data_dict = {"data": data_string}

In [None]:
#replace the key below with your own
client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key="your_key",
)

In [None]:
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "user", "content":  f"Analyze the average sales for the top 10 cities: {data_dict['data']}"}
    ],
    model="gpt-3.5-turbo",
)

narrative = chat_completion.choices[0].message.content

narrative.replace('\n', ' ')

"To analyze the average sales for the top 10 cities, we need to first extract the sales data for each city from the given list. Then, we can calculate the average sales for each city and finally determine the average sales for the top 10 cities.  Here are the sales data for the top 10 cities: 1. City 1: 25.5000 2. City 2: 2729.9860 3. City 3: 5519.5700 4. City 4: 3773.0628 5. City 5: 20214.5320 6. City 6: 17197.8400 7. City 7: 11656.4780 8. City 8: 9063.4960 9. City 9: 7452.9960 10. City 10: 64504.7604  Now, let's calculate the average sales for the top 10 cities:  Total sales = 25.5000 + 2729.9860 + 5519.5700 + 3773.0628 + 20214.5320 + 17197.8400 + 11656.4780 + 9063.4960 + 7452.9960 + 64504.7604 = 144538.2212  Average sales = Total sales / Number of cities Average sales = 144538.2212 / 10 = 14453.82212  Therefore, the average sales for the top 10 cities is $14,453.82."

In [None]:
x