In [4]:
import os
import pandas as pd
from smartdata import SmartData
from dotenv import load_dotenv

load_dotenv()
os.getenv('OPENAI_API_KEY')

# Or Set OpenAI API key here :)
# os.environ["OPENAI_API_KEY"] = "Your openai key"

# Read sample data
df = pd.read_csv(r"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv", index_col=0)

# Create SmartData Model
sd = SmartData(df, memory_size = 0, show_detail = False)
prompt, sd_model = sd.create_model()

# Clean Data 
# - summary: this is a summary of data cleaning result include action taken, impacted records etc. 
# - has_changes_to_df: this is a boolean to indicate whether any changes to the existing df.
# - df_new: this is the new cleaned dataframe after all the clean process.
summary, has_changes_to_df, df_new = sd.clean_data()
print(summary)
print("has_changes_to_df: "+str(has_changes_to_df))
print(df_new.head(5))

We’ve made some great strides in cleaning up our data! Here’s a friendly overview of what we accomplished:

- **Filled in missing values** for numeric columns using the average, like Age (177 values).
- **Capped outliers** for several columns to keep our data in check:
  - Age: 7 to 8
  - SibSp: 0 to 7
  - Parch: 0 to 6
  - Fare: 0 to 9
- **Categorical columns** with missing values were filled with 'Not Specified':
  - Cabin (687 values) and Embarked (2 values).
  
We didn’t remove any rows or columns, which is fantastic! We also standardized categories for **Sex**, **Embarked**, and **Cabin**, and replaced unreasonable values in **Age**, **Fare**, **SibSp**, and **Parch** with their averages. Overall, our dataset is now cleaner and ready for analysis!
has_changes_to_df: True
             Survived  Pclass  \
PassengerId                     
1                   0       3   
2                   1       1   
3                   1       3   
4                   1       1   
5              