# Understanding your data by asking basic questions

#### In case of any queries you can reach out to me on LinkedIn  <a href="https://www.linkedin.com/in/sandeep-kumar-mahato/" target="_blank">Sandeep Kumar Mahato</a>

In [1]:
# Imporing Pandas Library
import pandas as pd

## Reading `Samsung Mobiles Dataset.csv` dataset which had been extracted by Web Scraping 

In [2]:
df = pd.read_csv('Samsung Mobiles Data.csv')

# 1. How big is the data?

In [3]:
df.shape

(384, 10)

### Dataset Shape (`df.shape`)

The dataset contains:

- **384 rows**: Each row represents a data record (e.g., a mobile phone).
- **10 columns**: Each column represents a feature or attribute of the records.


# 2. How does the data look like?

In [4]:
df.head()

Unnamed: 0,Product Name,Colours,Price,Rating,Ratings Count,Reviews Count,RAM (GB),ROM (GB),Display (inch),Battery Capacity (mAh)
0,SAMSUNG Galaxy A14 5G,Dark Red,10999.0,4.2,63583.0,3046.0,6.0,128.0,6.6,5000
1,SAMSUNG Galaxy A14 5G,Light Green,10999.0,4.2,63583.0,3046.0,6.0,128.0,6.6,5000
2,SAMSUNG Galaxy S23 5G,Cream,42999.0,4.6,65179.0,4462.0,8.0,256.0,6.1,3900
3,SAMSUNG Galaxy F05,Twilight Blue,6499.0,4.2,21063.0,1148.0,4.0,64.0,6.74,5000
4,SAMSUNG Galaxy A14 5G,Black,10999.0,4.2,63583.0,3046.0,6.0,128.0,6.6,5000


In [6]:
df.sample(5)

Unnamed: 0,Product Name,Colours,Price,Rating,Ratings Count,Reviews Count,RAM (GB),ROM (GB),Display (inch),Battery Capacity (mAh)
129,SAMSUNG Galaxy F13,Waterfall Blue,14999.0,4.3,8875.0,11743.0,4.0,64.0,6.6,6000
148,SAMSUNG Galaxy A15 5G,Light Blue,14399.0,4.3,1564.0,73.0,6.0,128.0,6.5,5000
237,SAMSUNG Galaxy S23 5G,Phantom Black,42999.0,4.6,65179.0,4462.0,8.0,256.0,6.1,3900
272,SAMSUNG Galaxy Z Flip5,Mint,102499.0,4.4,755.0,45.0,8.0,512.0,6.7,3700
317,SAMSUNG Galaxy A14 5G,Light Green,10999.0,4.2,63583.0,3046.0,6.0,128.0,6.6,5000


# 3. What is the data type of cols?

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Product Name            384 non-null    object 
 1   Colours                 376 non-null    object 
 2   Price                   379 non-null    float64
 3   Rating                  380 non-null    float64
 4   Ratings Count           380 non-null    float64
 5   Reviews Count           380 non-null    float64
 6   RAM (GB)                376 non-null    float64
 7   ROM (GB)                376 non-null    float64
 8   Display (inch)          384 non-null    float64
 9   Battery Capacity (mAh)  384 non-null    int64  
dtypes: float64(7), int64(1), object(2)
memory usage: 30.1+ KB


### Dataset Overview (`df.info()`)

The `df.info()` method gives a concise summary of the dataset:

1. **Shape**: 384 rows and 10 columns.
2. **Non-null Counts**: Indicates missing values in some columns:
   - **Colours**: 376 non-null (8 missing).
   - **Price**: 379 non-null (5 missing).
   - **Rating**, **Ratings Count**, **Reviews Count**: Each with 380 non-null values (4 missing).
   - **RAM (GB)** and **ROM (GB)**: 376 non-null (8 missing each).
3. **Data Types**:
   - **object (2 columns)**: Text data for `Product Name` and `Colours`.
   - **float64 (7 columns)**: Continuous numeric features like `Price`, `Rating`, etc.
   - **int64 (1 column)**: Integer data for `Battery Capacity`.

### Key Observations:
- Missing values are present in some columns and need to be handled during preprocessing.
- Numeric data dominates the dataset, which is useful for statistical and ML analyses.


# 4. Are there any missing values?

In [8]:
df.isnull().sum()

Product Name              0
Colours                   8
Price                     5
Rating                    4
Ratings Count             4
Reviews Count             4
RAM (GB)                  8
ROM (GB)                  8
Display (inch)            0
Battery Capacity (mAh)    0
dtype: int64

### Missing Values in the Dataset (`df.isnull().sum()`)

The missing value analysis reveals:

1. **No missing values** in:
   - Product Name
   - Display (inch)
   - Battery Capacity (mAh)

2. **Columns with missing values**:
   - **Colours**, **RAM (GB)**, **ROM (GB)**: 8 missing values each.
   - **Price**: 5 missing values.
   - **Rating**, **Ratings Count**, **Reviews Count**: 4 missing values each.

### Key Observations:
- Missing data exists in critical columns like **Price** and **RAM (GB)**, which are important for analysis.
- Handling these missing values (e.g., imputation or removal) will be necessary for further processing.


# 5. How does the data look mathematically?

In [9]:
df.describe()

Unnamed: 0,Price,Rating,Ratings Count,Reviews Count,RAM (GB),ROM (GB),Display (inch),Battery Capacity (mAh)
count,379.0,380.0,380.0,380.0,376.0,376.0,384.0,384.0
mean,35108.356201,4.328421,31484.997368,2768.807895,7.140957,171.957447,6.340443,4546.729167
std,31215.593884,0.26904,30201.99725,4323.875042,2.156249,99.844777,0.724942,945.763879
min,1149.0,2.8,8.0,0.0,2.0,16.0,1.5,4.0
25%,12299.0,4.2,1028.0,68.75,6.0,128.0,6.1,3900.0
50%,30990.0,4.3,14975.0,2397.5,8.0,128.0,6.5,5000.0
75%,42999.0,4.6,65179.0,4462.0,8.0,256.0,6.6,5000.0
max,176999.0,4.8,68242.0,53939.0,12.0,512.0,7.6,7000.0


### Statistical Summary of the Dataset

The `df.describe()` function provides a quick overview of the numerical features in the dataset. Here is what we can infer:

1. **Price**:
   - Average: ₹35,108.36
   - Range: ₹1,149 to ₹1,76,999
   - Most phones fall between ₹12,299 (Q1) and ₹42,999 (Q3).

2. **Rating**:
   - Average: 4.33
   - Most ratings are between 4.2 (Q1) and 4.6 (Q3), indicating good customer feedback.

3. **Ratings Count & Reviews Count**:
   - Ratings Count: Varies widely from 8 to 68,242.
   - Reviews Count: Some phones have a high number of reviews, up to 53,939.

4. **RAM (GB)**:
   - Average: 7.14 GB
   - Range: 2 GB to 12 GB
   - Most phones have 6-8 GB RAM.

5. **ROM (GB)**:
   - Average: 171.96 GB
   - Range: 16 GB to 512 GB
   - Common configurations: 128 GB (Q1, median) and 256 GB (Q3).

6. **Display Size (inch)**:
   - Average: 6.34 inches
   - Range: 1.5 inches (likely an outlier) to 7.6 inches.
   - Typical sizes: 6.1 to 6.6 inches.

7. **Battery Capacity (mAh)**:
   - Average: 4,546.73 mAh
   - Range: 4 mAh (outlier) to 7,000 mAh.
   - Most phones have a capacity between 4,000-5,000 mAh.

### Key Observations:
- The dataset shows significant variability in features like Price, Ratings, and Reviews.
- Common configurations for phones include 6-8 GB RAM, 128-256 GB ROM, and a battery capacity of 4,000-5,000 mAh.
- Outliers are present, especially in Display Size and Battery Capacity, which may require further analysis.


# 6. Are there duplicate values?

In [10]:
df.duplicated().sum()

158

### Duplicated Rows in the Dataset

The result of `df.duplicated().sum()` shows **158**, meaning there are **158 duplicate rows** in the dataset. These rows have identical values to other rows and may affect the accuracy of analysis or models.

### Why It Matters:
- Duplicate rows can introduce redundancy and bias into your analysis.
- It is essential to handle duplicates to maintain data integrity.

You can remove these duplicates using `df.drop_duplicates()` if needed.


# 7. How is the correlation between columns?

- **Drop Non-Numeric Columns**: The code below creates a new DataFrame `df_numeric` that only includes numeric columns (float and int types) using `select_dtypes()`.

In [20]:
# Dropping non-numeric columns before calculating correlation
df_numeric = df.select_dtypes(include=['float64', 'int64'])

In [21]:
df_numeric.corr()

Unnamed: 0,Price,Rating,Ratings Count,Reviews Count,RAM (GB),ROM (GB),Display (inch),Battery Capacity (mAh)
Price,1.0,0.368657,-0.148168,-0.118029,0.760729,0.803266,0.228957,-0.157649
Rating,0.368657,1.0,0.453763,0.210701,0.444346,0.300371,0.286748,-0.010269
Ratings Count,-0.148168,0.453763,1.0,0.433752,0.042014,-0.111023,-0.112435,-0.254514
Reviews Count,-0.118029,0.210701,0.433752,1.0,-0.11479,-0.153943,-0.074839,-0.048284
RAM (GB),0.760729,0.444346,0.042014,-0.11479,1.0,0.708975,0.104326,-0.282273
ROM (GB),0.803266,0.300371,-0.111023,-0.153943,0.708975,1.0,0.208889,-0.220629
Display (inch),0.228957,0.286748,-0.112435,-0.074839,0.104326,0.208889,1.0,0.77981
Battery Capacity (mAh),-0.157649,-0.010269,-0.254514,-0.048284,-0.282273,-0.220629,0.77981,1.0


In [22]:
df_numeric.corr()['Price']

Price                     1.000000
Rating                    0.368657
Ratings Count            -0.148168
Reviews Count            -0.118029
RAM (GB)                  0.760729
ROM (GB)                  0.803266
Display (inch)            0.228957
Battery Capacity (mAh)   -0.157649
Name: Price, dtype: float64

### Understanding `df_numeric.corr()['Price']` Result

The correlation values measure how strongly each feature is linearly related to the `Price` column in my dataset. Correlation values range from -1 to 1:  

- **1** indicates a perfect positive linear relationship.  
- **-1** indicates a perfect negative linear relationship.  
- **0** means no linear relationship.  

Here’s what the results suggest:  

1. **`RAM (GB)` (0.760729)** and **`ROM (GB)` (0.803266)**:  
   - Strong positive correlation with `Price`.  
   - Indicates that as RAM or ROM increases, the price of the mobile phone also tends to increase.  

2. **`Rating` (0.368657)**:  
   - Weak positive correlation with `Price`.  
   - Higher-rated phones may be slightly more expensive, but the relationship isn’t strong.  

3. **`Display (inch)` (0.228957)**:  
   - Very weak positive correlation with `Price`.  
   - Larger screen size might contribute to a higher price, but it’s not a major factor.  

4. **`Ratings Count` (-0.148168)** and **`Reviews Count` (-0.118029)**:  
   - Weak negative correlation with `Price`.  
   - Suggests that phones with more ratings and reviews might have slightly lower prices, perhaps due to higher availability or popularity among budget buyers.  

5. **`Battery Capacity (mAh)` (-0.157649)**:  
   - Weak negative correlation with `Price`.  
   - Indicates that higher battery capacity isn’t a strong determinant of price, and might even be slightly inversely related.  

---

### Markdown Cell Content

#### Correlation Analysis with Price  

The correlation values between different features and the `Price` column are as follows:  

| Feature                | Correlation with Price | Interpretation                                      |  
|------------------------|------------------------|----------------------------------------------------|  
| **RAM (GB)**           | 0.7607                | Strong positive correlation. Higher RAM leads to higher price. |  
| **ROM (GB)**           | 0.8033                | Strong positive correlation. Higher ROM leads to higher price. |  
| **Rating**             | 0.3687                | Weak positive correlation. Higher-rated phones are slightly costlier. |  
| **Display (inch)**     | 0.2290                | Very weak positive correlation. Larger screens slightly impact price. |  
| **Ratings Count**      | -0.1482               | Weak negative correlation. More ratings often imply budget-friendly phones. |  
| **Reviews Count**      | -0.1180               | Weak negative correlation. Higher reviews don’t strongly affect price. |  
| **Battery Capacity (mAh)** | -0.1576           | Weak negative correlation. Higher battery capacity doesn’t equate to higher price. |  

**Conclusion**:  
The price of mobile phones is strongly influenced by **RAM** and **ROM**, while other features like **Battery Capacity**, **Ratings**, and **Reviews Count** have a much weaker influence. Understanding these correlations can guide feature selection for building machine learning models.  