# 04.5 - LLMs in data analytics and data visualization: Smart Dataframes

Adapted from: https://colab.research.google.com/drive/18JKRdPCnsUIm9d4sThvsySSPAfbc4ybk

## Using Large Language Models (LLMs) in data visualization

LLM can help data visualization tasks in different ways:  
* They can read tables and help us understand the data by creating written summaries  
* They can translate our ideas of visualizations from english to code, accelerating the development process  

Drawbacks:
* They are still not very good. But their potential is increasing.
  
Contents  
* How to instantiate a smart dataframe: pandas + agent.
* How to get simple information from the csv table
* How to create simple plots from english instructions.

Keep in mind OpenAI recomended best practices to get the most out of uyour queries    
https://platform.openai.com/docs/guides/gpt-best-practices/six-strategies-for-getting-better-results

Official example:  
https://colab.research.google.com/drive/1ZnO-njhL7TBOYPZaqvMvGtsjckZKrv2E?usp=sharing

In [None]:
# Required libs
!pip install pandasai
!pip install seaborn

In [None]:
# Importing libraries
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandasai import SmartDataframe
from langchain_openai import ChatOpenAI

In [None]:
# Importing the data
df = pd.read_csv('../Files/SAheart.data')
df.head()

### Basic EDA
The data we are analysing comes from https://rdrr.io/cran/ElemStatLearn/man/SAheart.html and concerns coronary heart disease (chd).   


Columns | description
--------|------------     
sbp |       systolic blood pressure |   
tobacco|   cumulative tobacco (kg)  
ldl|       low density lipoprotein cholesterol  
adiposity| a numeric vector  
famhist|   family history of heart disease, a factor with levels Absent Present  
typea|     type-A behavior  
obesity|   a numeric vector  
alcohol|   current alcohol consumption  
age|       age at onset  
chd|       response, coronary heart disease  

### Creating a Smart Dataframe

We need an LLM and a pandas dataframe (which we already have).

In [None]:
llm = ChatOpenAI(temperature=0.5) #let's allow for some randomness

In [None]:
llm.model_name

In [None]:
sdf = SmartDataframe(df, config={'llm':llm})

In [None]:
# Let's check the shape of data.'
sdf.chat("What is the shape of the dataset?")

In [None]:
#identifying missing values
sdf.chat("How many missing values are there in each column?")

In [None]:
# Let us see how the data looks like
sdf.chat("Display 5 records in form of a table.")

In [None]:
# Let us try to write a simple summary of the data:
sdf.chat("A good data summary must have: the number of elements, a qualitative discription of the population in the data by describing the basic statistics for age and family history (famhist).  \
     You first must find the relevant columns in the dataset, age and famhist, then compute the basic statistics and then write the summary.\
     The summary should look like: We have a simple dataset with 222 elements. The population in the set has ages averaging 56, but there are a few elements as young as 10 years and as old as 100. Half of the population has family history, and those that have are much younger - averaging the 40 years.")

In [None]:
sdf.chat("Show the distribution of people suffering with chd using bar graph.")

In [None]:
sdf.chat("""Show the distribution of age where the person is
suffering with chd using histogram with
0 to 10, 10 to 20, 20 to 30 years and so on. Do not show grid. Add title.""")

In [None]:
sdf.chat("""Draw boxplot to find out if there are any outliers
in terms of age of who are suffering from chd.""")

In [None]:
# Does Tobacco Cause CHD?
sdf.chat("""validate the following hypothesis with t-test.
Null Hypothesis: Consumption of Tobacco does not cause chd.
Alternate Hypothesis: Consumption of Tobacco causes chd.""")

In [None]:
# How is the distribution of CHD across various age groups
sdf.chat("""Plot the distribution of age for both chd positive and negative using kde plot. Also provide a legend and label the x and y axises.""")

In [None]:
sdf.chat("""Plot the distribution of age for both chd positive and negative using kde plot. Also provide a legend and label the x and y axises. Use shaded areas.""")

In [None]:
sdf.chat("""Plot the distribution of age for both chd positive and negative using kde plot. Also provide a legend and label the x and y axises. Use shaded areas and absolute values.""")

In [None]:
sns.kdeplot(data=df, x='age', hue='chd')

Comments?