# 1. introduction

## 1.1 about dataset

<p>
    This dataset simulates sales transactions for mobile phones and laptops, including product specifications, customer details, and sales information. It contains 50,000 rows of randomly generated data to help analyze product sales trends, customer purchasing behavior, and regional distribution of sales.
</p>
<p>you can download it from <a href="https://www.kaggle.com/datasets/vinothkannaece/mobiles-and-laptop-sales-data">here<a/>. </p>

## 1.2 what are we going to do?

<p>we are going to do some analysis and then use some MachineLearning models on it.</p>

# 2. understand dataset

## 2.1 import needed libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 2.2 read dataset

In [2]:
df = pd.read_csv("./mobile_sales_data.csv")

## 2.3 get shape and columns name

In [3]:
df.shape

(50000, 16)

In [4]:
df.columns

Index(['Product', 'Brand', 'Product Code', 'Product Specification', 'Price',
       'Inward Date', 'Dispatch Date', 'Quantity Sold', 'Customer Name',
       'Customer Location', 'Region', 'Core Specification',
       'Processor Specification', 'RAM', 'ROM', 'SSD'],
      dtype='object')

## 2.4 rename columns

In [5]:
df = df.rename(columns={
    'Product': 'product_name',
    'Brand': 'brand_name',
    'Product Code': 'product_id',
    'Product Specification': 'product_specs',
    'Price': 'price',
    'Inward Date': 'received_date',
    'Dispatch Date': 'shipped_date',
    'Quantity Sold': 'units_sold',
    'Customer Name': 'customer_name',
    'Customer Location': 'customer_city',
    'Region': 'sales_region',
    'Core Specification': 'core_specs',
    'Processor Specification': 'processor_details',
    'RAM': 'ram_size',
    'ROM': 'rom_size',
    'SSD': 'ssd_size'
})
df.head(n=3)

Unnamed: 0,product_name,brand_name,product_id,product_specs,price,received_date,shipped_date,units_sold,customer_name,customer_city,sales_region,core_specs,processor_details,ram_size,rom_size,ssd_size
0,Mobile Phone,Motorola,88EB4558,Site candidate activity company there bit insi...,78570,2023-08-02,2023-08-03,6,William Hess,South Kelsey,Central,,Snapdragon 7 Gen,12GB,128GB,
1,Laptop,Oppo,416DFEEB,Beat put care fight affect address his.,44613,2023-10-03,2023-10-06,1,Larry Smith,North Lisa,South,Ryzen 5,Ryzen 5,8GB,512GB,256GB
2,Mobile Phone,Samsung,9F975B08,Energy special low seven place audience.,159826,2025-03-19,2025-03-20,5,Leah Copeland,South Todd,Central,,MediaTek Dimensity,8GB,256GB,


## 2.5 get info about features

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   product_name       50000 non-null  object
 1   brand_name         50000 non-null  object
 2   product_id         50000 non-null  object
 3   product_specs      50000 non-null  object
 4   price              50000 non-null  int64 
 5   received_date      50000 non-null  object
 6   shipped_date       50000 non-null  object
 7   units_sold         50000 non-null  int64 
 8   customer_name      50000 non-null  object
 9   customer_city      50000 non-null  object
 10  sales_region       50000 non-null  object
 11  core_specs         25017 non-null  object
 12  processor_details  50000 non-null  object
 13  ram_size           50000 non-null  object
 14  rom_size           50000 non-null  object
 15  ssd_size           25017 non-null  object
dtypes: int64(2), object(14)
memory usage: 6.

## 2.6 final analysis

1. **dataset**:
   - we've got 50000 rows and 16 columns.
   - currently only *price* and *units_sold* are numerical.
   - we have a regression task!

2. **about features (as what it's owner state)**:
    - *product_name*: Type of product (Mobile Phone / Laptop).
    - *brand_name*: Various brands like Apple, Samsung, Dell, Lenovo, OnePlus, etc.
    - *product_id*: Unique identifier for each product.
    - *product_specs*: Brief description of the product features.
    - *price*: Cost of the product (randomly generated).
    - *received_date*: Date when the product was received in stock.
    - *shipped_date*: Date when the product was sold/dispatched.
    - *units_sold*: Number of units sold per transaction.
    - *customer_name*: Randomly generated customer names.
    - *customer_city*: City of the customer.
    - *sales_region*: Sales region (North, South, East, West, Central).
    - *core_specs* (For Laptops): Includes processor models like i3, i5, i7, i9, Ryzen 3-9.
    - *processor_details* (For Mobiles): Includes processors like Snapdragon, Exynos, Apple A-Series, and MediaTek Dimensity.
    - *ram_size*: Randomly assigned memory sizes (4GB to 32GB).
    - *rom_size*: Storage capacity (64GB to 1TB).
    - *ssd_size* (For Laptops): Additional storage (256GB to 2TB), "N/A" for mobile phones.