<a href="https://colab.research.google.com/github/tpkeeyeerpa2024/ADALL_github/blob/main/ADALL_W2S2_7Jan26.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import userdata
from openai import OpenAI
import pandas as pd


# Load key from Google Colab Secrets
api_key = userdata.get('OPENAI_API_KEY')


client = OpenAI(
# This is the default and can be omitted
api_key=api_key,
)

# Load dataset
data = 'https://raw.githubusercontent.com/tpkeeyeerpa2024/ADALL_github/refs/heads/main/laptop_prices_2024_sgd_TL.csv'
df = pd.read_csv(data)
print("Data loaded")

# Convert the first few rows to a string to send to OpenAI
data_preview = df.head(10).to_csv(index=False)
print(data_preview)

response = client.responses.create(
    model="gpt-4o-mini",
    instructions="You are an expert data scientist with extensive knowledge of predictive analysis and linear regression.",
    input=f"Dataset: Laptop Prices \nHere are the first 10 rows of the dataset:\n{data_preview}]",
)
print(response.output_text)




Data loaded
Brand,Model,CPU,GPU,RAM_GB,Storage_Type,Storage_GB,Touchscreen,Weight_kg,Screen_Size_inch,Discount_percent,Price_SGD,Brand_Discount,Member_Discount
Acer,Aspire 5,Intel i9-14900HK,NVIDIA RTX 4070,64,SSD,256,False,1.56,16.0,3.27,2413.36,5,144.8
Acer,Nitro 5,AMD Ryzen 9 8900HX,AMD Radeon 780M,32,SSD,1024,True,1.45,14.0,5.03,1773.75,5,124.16
Acer,Nitro 5,AMD Ryzen 5 8600H,NVIDIA RTX 4050,32,SSD,2048,False,1.34,14.0,4.41,1634.07,5,98.04
Acer,TravelMate P6,Intel Core Ultra 7 15500H,NVIDIA RTX 4060,16,SSD,4096,True,1.18,13.3,2.16,2362.59,5,118.13
Acer,Predator Helios 300,Intel i7-14800H,NVIDIA RTX 4070,8,SSD,1024,True,1.31,14.0,6.93,2218.55,5,155.3
Acer,Aspire 5,Intel i9-14900HK,NVIDIA RTX 4080,32,SSD,256,False,3.34,17.3,8.94,2224.12,5,155.69
Acer,TravelMate P6,Intel i9-14900HK,NVIDIA RTX 4080,128,SSD,1024,True,1.11,13.3,11.72,3016.36,5,211.15
Acer,Predator Helios 300,Intel Core Ultra 9 15700H,NVIDIA RTX 4070,16,SSD,2048,True,1.29,13.3,8.84,2579.29,5,180.55
Acer,Swift 3,Intel i5-1

In [None]:
# Describe the dataset size, column types, missing values and highlight any unusual patterns

# What does one row of record represents?

# What is the target variable?

# Is the data clean? Does it contain missing values? How about duplicate values?

Mounted at /content/drive


# Task
Analyze the structure and quality of the `df` DataFrame by describing its shape, confirming 'Price_SGD' as the target variable, checking data types and non-null counts, identifying duplicate rows, generating descriptive statistics for numerical columns, and analyzing categorical column distributions, then summarize all findings.

## Describe Dataset Shape and Row Representation

### Subtask:
Determine the number of rows and columns in the DataFrame (`df.shape`) and clearly state that each row represents a unique laptop model with its specifications and pricing information.


**Reasoning**:
First, I will print the shape of the DataFrame to show the number of rows and columns, as instructed.



In [None]:
print(f"DataFrame shape: {df.shape}")

DataFrame shape: (1000, 14)


Each row in the DataFrame `df` represents a unique laptop model. It includes various specifications such as Brand, Model, CPU, GPU, RAM, Storage Type and Size, Touchscreen availability, Weight, Screen Size, and pricing details including Discount Percent, Price in SGD, Brand Discount, and Member Discount.

## Identify Target Variable

### Subtask:
Confirm that 'Price_SGD' is the target variable for predictive analysis, as identified by the previous analysis.


The target variable for this predictive analysis is `Price_SGD`. This variable represents the price of laptops in Singapore Dollars, which we aim to predict based on other features in the dataset.

## Check Data Types and Non-Null Counts

### Subtask:
Display the DataFrame's information to show column names, their data types, and the count of non-null values.


**Reasoning**:
To display the DataFrame's information, including column names, their data types, and the count of non-null values, I will use the `.info()` method.



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Brand             1000 non-null   object 
 1   Model             1000 non-null   object 
 2   CPU               1000 non-null   object 
 3   GPU               1000 non-null   object 
 4   RAM_GB            1000 non-null   int64  
 5   Storage_Type      1000 non-null   object 
 6   Storage_GB        1000 non-null   int64  
 7   Touchscreen       1000 non-null   bool   
 8   Weight_kg         1000 non-null   float64
 9   Screen_Size_inch  1000 non-null   float64
 10  Discount_percent  1000 non-null   float64
 11  Price_SGD         1000 non-null   float64
 12  Brand_Discount    1000 non-null   int64  
 13  Member_Discount   1000 non-null   float64
dtypes: bool(1), float64(5), int64(3), object(5)
memory usage: 102.7+ KB


## Check for Duplicate Rows

### Subtask:
Identify and count any fully duplicated rows in the dataset using `df.duplicated().sum()` to assess data uniqueness.


**Reasoning**:
To identify and count fully duplicated rows, I will use the `.duplicated()` method combined with `.sum()` on the DataFrame `df`.



In [None]:
duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

Number of duplicate rows: 0


**Reasoning**:
The subtask of checking for duplicate rows is complete. Now, I will generate descriptive statistics for the numerical columns in the DataFrame `df` to understand their distribution and central tendencies.



In [None]:
print(df.describe())

            RAM_GB   Storage_GB    Weight_kg  Screen_Size_inch  \
count  1000.000000  1000.000000  1000.000000       1000.000000   
mean     53.128000  1505.024000     2.036560         15.645700   
std      44.413288  1380.203919     0.746477          1.759846   
min       8.000000   256.000000     1.000000         13.300000   
25%      16.000000   512.000000     1.340000         14.000000   
50%      32.000000  1024.000000     1.970000         15.600000   
75%      64.000000  2048.000000     2.680000         17.300000   
max     128.000000  4096.000000     3.500000         18.400000   

       Discount_percent    Price_SGD  Brand_Discount  Member_Discount  
count       1000.000000  1000.000000       1000.0000      1000.000000  
mean           7.538230  2277.796410          6.7680       135.810660  
std            4.334012   616.496246          2.6346        41.344593  
min            0.000000   995.770000          3.0000        50.740000  
25%            3.815000  1776.212500         

**Reasoning**:
Now that descriptive statistics for numerical columns have been generated, I will analyze the distribution of categorical columns. This will involve iterating through each categorical column and using `value_counts()` to display the frequency of each unique category.



In [None]:
categorical_cols = df.select_dtypes(include='object').columns

print("\nDistribution of Categorical Columns:")
for col in categorical_cols:
    print(f"\n--- {col} ---")
    print(df[col].value_counts())



Distribution of Categorical Columns:

--- Brand ---
Brand
Asus         177
Acer         176
Microsoft    174
Lenovo       174
Dell         163
HP           136
Name: count, dtype: int64

--- Model ---
Model
Predator Helios 300      48
ExpertBook B9            43
Surface Book 3           43
Legion 5                 40
Alienware M15            38
Aspire 5                 38
Surface Go 3             38
ThinkBook 14s            37
XPS 15                   37
Surface Pro 8            35
TUF Gaming F15           35
Swift 3                  35
Yoga Slim 7              35
ROG Zephyrus G15         34
Pavilion 15              34
ZenBook 14               33
Surface Laptop Studio    33
ThinkPad X1 Carbon       32
VivoBook S15             32
Latitude 7420            32
Nitro 5                  31
Omen 15                  31
IdeaPad 5                30
XPS 13                   28
Inspiron 15              28
Envy 13                  27
Surface Laptop 5         25
TravelMate P6            24
Spectre 

### Summary of Dataset Structure and Quality Analysis

This analysis has examined the structure and quality of the `df` DataFrame, which contains information on various laptop models.

**1. Dataset Shape and Row Representation:**
*   The DataFrame `df` has **1000 rows** and **14 columns**, indicating a moderate-sized dataset.
*   Each row represents a unique laptop model, detailing its specifications (e.g., Brand, CPU, RAM) and pricing information (e.g., Price_SGD, Discount_percent).

**2. Target Variable:**
*   The target variable for predictive analysis has been confirmed as `Price_SGD`, representing laptop prices in Singapore Dollars.

**3. Data Types and Non-Null Counts:**
*   The `df.info()` output shows that all 14 columns have **1000 non-null entries**, indicating **no missing values** across the dataset. This is excellent for data quality.
*   Data types are appropriate for their respective columns:
    *   `object` for categorical text fields (Brand, Model, CPU, GPU, Storage_Type).
    *   `int64` for integer numerical values (RAM_GB, Storage_GB, Brand_Discount).
    *   `float64` for continuous numerical values (Weight_kg, Screen_Size_inch, Discount_percent, Price_SGD, Member_Discount).
    *   `bool` for boolean values (Touchscreen).

**4. Duplicate Rows:**
*   There are **0 duplicate rows** in the dataset, confirming the uniqueness of each laptop record.

**5. Descriptive Statistics for Numerical Columns:**
*   **RAM_GB**: Ranges from 8GB to 128GB, with a mean of ~53GB, showing good variability in RAM options.
*   **Storage_GB**: Ranges from 256GB to 4096GB, with a mean of ~1505GB, also indicating diverse storage capacities.
*   **Weight_kg**: Ranges from 1.0kg to 3.5kg, with a mean of ~2.04kg.
*   **Screen_Size_inch**: Ranges from 13.3 inches to 18.4 inches, with a mean of ~15.65 inches.
*   **Discount_percent**: Ranges from 0% to 15%, with a mean of ~7.54%.
*   **Price_SGD**: Ranges from ~995 SGD to ~4156 SGD, with a mean of ~2277 SGD, showing a broad price range.
*   **Brand_Discount**: Ranges from 3% to 10%, with a mean of ~6.77%.
*   **Member_Discount**: Ranges from ~50 SGD to ~281 SGD, with a mean of ~135.81 SGD.

**6. Distribution of Categorical Columns:**
*   **Brand**: Features 6 brands, with Asus and Acer being the most frequent, and HP being the least frequent.
*   **Model**: Contains 30 unique models, with 'Predator Helios 300' (Acer) being the most common, followed by 'ExpertBook B9' (Asus) and 'Surface Book 3' (Microsoft).
*   **CPU**: Shows 10 different CPU types, predominantly Intel processors, with 'Intel i5-14600H' being the most frequent.
*   **GPU**: Lists 9 distinct GPU types, with 'NVIDIA RTX 4070' being the most common, indicating a strong presence of dedicated graphics cards.
*   **Storage_Type**: All laptops use 'SSD', which simplifies this feature as it is not a differentiating factor within this dataset.

**Overall Conclusion:**
The dataset is clean and well-structured, with no missing values or duplicate rows. All data types are appropriate. The distributions of both numerical and categorical features show good variability, which is beneficial for building a predictive model for `Price_SGD`. The next steps would typically involve feature engineering and model building, given the clean state of the data.

### Summary of Dataset Structure and Quality Analysis

This analysis has examined the structure and quality of the `df` DataFrame, which contains information on various laptop models.

**1. Dataset Shape and Row Representation:**
*   The DataFrame `df` has **1000 rows** and **14 columns**, indicating a moderate-sized dataset.
*   Each row represents a unique laptop model, detailing its specifications (e.g., Brand, CPU, RAM) and pricing information (e.g., Price_SGD, Discount_percent).

**2. Target Variable:**
*   The target variable for predictive analysis has been confirmed as `Price_SGD`, representing laptop prices in Singapore Dollars.

**3. Data Types and Non-Null Counts:**
*   The `df.info()` output shows that all 14 columns have **1000 non-null entries**, indicating **no missing values** across the dataset. This is excellent for data quality.
*   Data types are appropriate for their respective columns:
    *   `object` for categorical text fields (Brand, Model, CPU, GPU, Storage_Type).
    *   `int64` for integer numerical values (RAM_GB, Storage_GB, Brand_Discount).
    *   `float64` for continuous numerical values (Weight_kg, Screen_Size_inch, Discount_percent, Price_SGD, Member_Discount).
    *   `bool` for boolean values (Touchscreen).

**4. Duplicate Rows:**
*   There are **0 duplicate rows** in the dataset, confirming the uniqueness of each laptop record.

**5. Descriptive Statistics for Numerical Columns:**
*   **RAM_GB**: Ranges from 8GB to 128GB, with a mean of ~53GB, showing good variability in RAM options.
*   **Storage_GB**: Ranges from 256GB to 4096GB, with a mean of ~1505GB, also indicating diverse storage capacities.
*   **Weight_kg**: Ranges from 1.0kg to 3.5kg, with a mean of ~2.04kg.
*   **Screen_Size_inch**: Ranges from 13.3 inches to 18.4 inches, with a mean of ~15.65 inches.
*   **Discount_percent**: Ranges from 0% to 15%, with a mean of ~7.54%.
*   **Price_SGD**: Ranges from ~995 SGD to ~4156 SGD, with a mean of ~2277 SGD, showing a broad price range.
*   **Brand_Discount**: Ranges from 3% to 10%, with a mean of ~6.77%.
*   **Member_Discount**: Ranges from ~50 SGD to ~281 SGD, with a mean of ~135.81 SGD.

**6. Distribution of Categorical Columns:**
*   **Brand**: Features 6 brands, with Asus and Acer being the most frequent, and HP being the least frequent.
*   **Model**: Contains 30 unique models, with 'Predator Helios 300' (Acer) being the most common, followed by 'ExpertBook B9' (Asus) and 'Surface Book 3' (Microsoft).
*   **CPU**: Shows 10 different CPU types, predominantly Intel processors, with 'Intel i5-14600H' being the most frequent.
*   **GPU**: Lists 9 distinct GPU types, with 'NVIDIA RTX 4070' being the most common, indicating a strong presence of dedicated graphics cards.
*   **Storage_Type**: All laptops use 'SSD', which simplifies this feature as it is not a differentiating factor within this dataset.

**Overall Conclusion:**
The dataset is clean and well-structured, with no missing values or duplicate rows. All data types are appropriate. The distributions of both numerical and categorical features show good variability, which is beneficial for building a predictive model for `Price_SGD`. The next steps would typically involve feature engineering and model building, given the clean state of the data.

## Final Task

### Subtask:
Summarize all findings regarding the dataset's size, column types, missing values, duplicates, the target variable, and any observed unusual patterns or characteristics.


## Summary:

### Data Analysis Key Findings

*   The dataset `df` is comprised of 1000 rows and 14 columns, with each row representing a unique laptop model, including specifications and pricing information.
*   The target variable for predictive analysis has been confirmed as `Price_SGD`, representing laptop prices in Singapore Dollars.
*   The dataset exhibits excellent data quality, with all 14 columns having 1000 non-null entries, indicating a complete absence of missing values.
*   Data types are appropriately assigned: 5 `object` columns for categorical text, 3 `int64` columns for integer numerical values, 5 `float64` columns for continuous numerical values, and 1 `bool` column.
*   There are 0 duplicate rows in the dataset, confirming the uniqueness of each laptop record.
*   Numerical columns show good variability:
    *   `RAM_GB` ranges from 8GB to 128GB (mean ~53GB).
    *   `Storage_GB` ranges from 256GB to 4096GB (mean ~1505GB).
    *   `Weight_kg` ranges from 1.0kg to 3.5kg (mean ~2.04kg).
    *   `Screen_Size_inch` ranges from 13.3 to 18.4 inches (mean ~15.65 inches).
    *   `Price_SGD` ranges from ~995 SGD to ~4156 SGD (mean ~2277 SGD), indicating a broad price spectrum.
*   Categorical columns also exhibit diversity:
    *   There are 6 distinct `Brand`s, with Asus and Acer being the most frequent.
    *   30 unique `Model`s are present, with 'Predator Helios 300' (Acer) being the most common.
    *   10 different `CPU` types are identified, predominantly Intel processors.
    *   9 distinct `GPU` types, with 'NVIDIA RTX 4070' being the most common.
    *   A notable observation is that `Storage_Type` is uniformly 'SSD' across all entries, making it a non-differentiating feature within this dataset.

### Insights or Next Steps

*   Given the dataset's high quality (no missing values or duplicates, appropriate data types), the next logical step would be to proceed with feature engineering to create more predictive variables, such as calculating price per GB of RAM or storage, or categorizing CPU/GPU tiers.
*   The uniformity of 'SSD' for `Storage_Type` suggests this column may be dropped or noted as a constant feature if the analysis focuses on differentiating factors; however, it confirms a modern standard for laptop storage in the dataset.


In [None]:
# Perform data exploration by highlighting the dataset size, column types, missing values and highlight anything unusual. Determine what one row of the daset represents, what the target is, if the data is clean, and if there are any constraints or assumptions.


# Task
Perform a comprehensive exploratory data analysis (EDA) on the `df` DataFrame to understand its structure, quality, and characteristics. This includes describing the dataset's shape and row representation, identifying the target variable ('Price_SGD'), checking data types and non-null counts, identifying duplicate rows, generating descriptive statistics for numerical columns, and analyzing the distribution of categorical columns. Finally, summarize all findings, including any observed unusual patterns or characteristics, and state any identified constraints or assumptions.

## Describe Dataset Shape and Row Representation

### Subtask:
Determine the number of rows and columns in the DataFrame (`df.shape`) and clearly state that each row represents a unique laptop model with its specifications and pricing information.


## Identify Target Variable

### Subtask:
Confirm that 'Price_SGD' is the target variable for predictive analysis, as identified by the previous analysis.


The target variable for this predictive analysis is `Price_SGD`. This variable represents the price of laptops in Singapore Dollars, which we aim to predict based on other features in the dataset.

## Final Task

### Subtask:
Summarize all findings regarding the dataset's size, column types, missing values, duplicates, the target variable, data cleanliness, and any observed unusual patterns or characteristics, including identified constraints or assumptions.


## Summary:

### Data Analysis Key Findings
*   The dataset, `df`, contains 1000 rows and 14 columns.
*   Each row in the DataFrame represents a unique laptop model, including its specifications and pricing information.
*   The target variable for predictive analysis is confirmed to be `Price_SGD`, which represents laptop prices in Singapore Dollars.
