Theoretical Assignment:

Q.1

Essay: Importance of Data Cleaning in Data Science
Introduction
Data cleaning, also known as data cleansing or data preprocessing, is a critical step in the data science workflow. It involves detecting and correcting (or removing) corrupt, inaccurate, incomplete, or irrelevant parts of the data. Clean data leads to better model performance, accurate insights, and trustworthy results.

Why Data Cleaning Is Important

Improves Data Quality: Raw data often contains missing values, outliers, or errors. Cleaning ensures the data is consistent and accurate.

Enhances Model Performance: Machine learning algorithms rely on high-quality data. Dirty data can reduce model accuracy or introduce bias.

Reduces Noise and Redundancy: Cleaning eliminates unnecessary variables or duplicates, helping models focus on meaningful features.

Prevents Misleading Insights: Clean data ensures correct conclusions, reducing the risk of making decisions based on flawed analysis.

Saves Time in the Long Run: Investing time in data cleaning early prevents downstream issues during analysis and modeling.

Common Data Cleaning Tasks

Handling missing values (e.g., imputation, deletion)

Removing duplicates

Correcting inconsistent formats

Filtering out outliers

Normalizing or standardizing data

Conclusion
Data cleaning is often seen as tedious, but it is one of the most essential tasks in data science. Without it, even the most advanced algorithms or sophisticated models will not yield reliable results. Clean data is the foundation of good data science.

Q.2

Presentation: Data Visualization Techniques
Slide 1: Title Slide

Title: Data Visualization Techniques

Subtitle: Turning Data Into Insight

Presenter’s Name & Date

Slide 2: Introduction

Definition of Data Visualization

Importance: Understand patterns, trends, and outliers

Slide 3: Types of Data Visualization

Categorical: Bar charts, pie charts

Quantitative: Histograms, line graphs

Relational: Scatter plots, bubble charts

Geographical: Maps, choropleths

Temporal: Time series plots

Slide 4: Tools for Data Visualization

Python Libraries: Matplotlib, Seaborn, Plotly

BI Tools: Tableau, Power BI

Web-Based: D3.js, Google Charts

Slide 5: Best Practices

Use the right chart for the data

Avoid clutter; keep it simple

Label axes and legends clearly

Use color meaningfully (not decoratively)

Slide 6: Advanced Techniques

Heatmaps for correlation

Interactive dashboards

Animated plots (e.g., with Plotly or Flourish)

Slide 7: Real-World Examples

COVID-19 dashboards

Stock market trends

Customer segmentation visuals

Slide 8: Conclusion

Effective visualizations = Better decisions

A picture is worth a thousand data points

Let me know if you want this in PowerPoint or PDF format, or need visuals included!

Prectical Assignment:

Q.1

In [201]:
import pandas as pd
import numpy as np

In [202]:
df = pd.read_csv('data.csv')

In [203]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [204]:
df.shape

(11914, 16)

In [205]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes: float64(3), int64(5

In [206]:
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

In [207]:
df = df.dropna(subset=['Engine Fuel Type','Number of Doors','Engine Cylinders','Engine HP'])

In [208]:
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        0
Engine HP               0
Engine Cylinders        0
Transmission Type       0
Driven_Wheels           0
Number of Doors         0
Market Category      3728
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

In [209]:
df['Market Category'] = df['Market Category'].fillna(df['Market Category'].mode()[0])

In [210]:
df.isnull().sum()

Make                 0
Model                0
Year                 0
Engine Fuel Type     0
Engine HP            0
Engine Cylinders     0
Transmission Type    0
Driven_Wheels        0
Number of Doors      0
Market Category      0
Vehicle Size         0
Vehicle Style        0
highway MPG          0
city mpg             0
Popularity           0
MSRP                 0
dtype: int64

In [211]:
df.duplicated().sum()

np.int64(715)

In [212]:
df = df.drop_duplicates()

In [213]:
df.duplicated().sum()

np.int64(0)

In [214]:
df.isnull().sum()

Make                 0
Model                0
Year                 0
Engine Fuel Type     0
Engine HP            0
Engine Cylinders     0
Transmission Type    0
Driven_Wheels        0
Number of Doors      0
Market Category      0
Vehicle Size         0
Vehicle Style        0
highway MPG          0
city mpg             0
Popularity           0
MSRP                 0
dtype: int64

In [215]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11097 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11097 non-null  object 
 1   Model              11097 non-null  object 
 2   Year               11097 non-null  int64  
 3   Engine Fuel Type   11097 non-null  object 
 4   Engine HP          11097 non-null  float64
 5   Engine Cylinders   11097 non-null  float64
 6   Transmission Type  11097 non-null  object 
 7   Driven_Wheels      11097 non-null  object 
 8   Number of Doors    11097 non-null  float64
 9   Market Category    11097 non-null  object 
 10  Vehicle Size       11097 non-null  object 
 11  Vehicle Style      11097 non-null  object 
 12  highway MPG        11097 non-null  int64  
 13  city mpg           11097 non-null  int64  
 14  Popularity         11097 non-null  int64  
 15  MSRP               11097 non-null  int64  
dtypes: float64(3), int64(5), ob

In [216]:
cat_values = ['Make','Model','Engine Fuel Type','Transmission Type','Driven_Wheels','Market Category','Vehicle Size','Vehicle Style']

In [217]:
for i in cat_values:
    print(df[i].value_counts())
    print(" ")
    print(" ")

Make
Chevrolet        1075
Ford              812
Toyota            716
Volkswagen        564
Nissan            541
Dodge             529
GMC               482
Honda             431
Cadillac          396
Mazda             392
Mercedes-Benz     340
Suzuki            339
Infiniti          328
BMW               324
Audi              321
Volvo             266
Hyundai           259
Acura             246
Subaru            239
Kia               224
Mitsubishi        205
Lexus             202
Buick             190
Chrysler          187
Pontiac           181
Lincoln           152
Land Rover        139
Porsche           136
Oldsmobile        132
Saab              109
Aston Martin       91
Bentley            74
Plymouth           71
Ferrari            68
Scion              60
FIAT               59
Maserati           55
Lamborghini        52
Rolls-Royce        31
Lotus              28
HUMMER             17
Maybach            16
Alfa Romeo          5
McLaren             5
Genesis             3
Bugat

IN the transmission type column drop the UNKNOWN values 

In [218]:
df = df[df['Transmission Type'] != 'UNKNOWN']

In [219]:
df['Transmission Type'].value_counts()


Transmission Type
AUTOMATIC           7897
MANUAL              2621
AUTOMATED_MANUAL     552
DIRECT_DRIVE          15
Name: count, dtype: int64