Theoretical Assignment:

Q.1

Essay: Importance of Data Cleaning in Data Science
Introduction
Data cleaning, also known as data cleansing or data preprocessing, is a critical step in the data science workflow. It involves detecting and correcting (or removing) corrupt, inaccurate, incomplete, or irrelevant parts of the data. Clean data leads to better model performance, accurate insights, and trustworthy results.

Why Data Cleaning Is Important

Improves Data Quality: Raw data often contains missing values, outliers, or errors. Cleaning ensures the data is consistent and accurate.

Enhances Model Performance: Machine learning algorithms rely on high-quality data. Dirty data can reduce model accuracy or introduce bias.

Reduces Noise and Redundancy: Cleaning eliminates unnecessary variables or duplicates, helping models focus on meaningful features.

Prevents Misleading Insights: Clean data ensures correct conclusions, reducing the risk of making decisions based on flawed analysis.

Saves Time in the Long Run: Investing time in data cleaning early prevents downstream issues during analysis and modeling.

Common Data Cleaning Tasks

Handling missing values (e.g., imputation, deletion)

Removing duplicates

Correcting inconsistent formats

Filtering out outliers

Normalizing or standardizing data

Conclusion
Data cleaning is often seen as tedious, but it is one of the most essential tasks in data science. Without it, even the most advanced algorithms or sophisticated models will not yield reliable results. Clean data is the foundation of good data science.

Q.2

Presentation: Data Visualization Techniques
Slide 1: Title Slide

Title: Data Visualization Techniques

Subtitle: Turning Data Into Insight

Presenter’s Name & Date

Slide 2: Introduction

Definition of Data Visualization

Importance: Understand patterns, trends, and outliers

Slide 3: Types of Data Visualization

Categorical: Bar charts, pie charts

Quantitative: Histograms, line graphs

Relational: Scatter plots, bubble charts

Geographical: Maps, choropleths

Temporal: Time series plots

Slide 4: Tools for Data Visualization

Python Libraries: Matplotlib, Seaborn, Plotly

BI Tools: Tableau, Power BI

Web-Based: D3.js, Google Charts

Slide 5: Best Practices

Use the right chart for the data

Avoid clutter; keep it simple

Label axes and legends clearly

Use color meaningfully (not decoratively)

Slide 6: Advanced Techniques

Heatmaps for correlation

Interactive dashboards

Animated plots (e.g., with Plotly or Flourish)

Slide 7: Real-World Examples

COVID-19 dashboards

Stock market trends

Customer segmentation visuals

Slide 8: Conclusion

Effective visualizations = Better decisions

A picture is worth a thousand data points

Let me know if you want this in PowerPoint or PDF format, or need visuals included!

Prectical Assignment:

Q.1

In [142]:
import pandas as pd
import numpy as np

In [143]:
df = pd.read_csv('data.csv')

In [144]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [145]:
df.shape

(11914, 16)

In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes: float64(3), int64(5

In [147]:
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

In [148]:
df['Engine Fuel Type'].fillna(df['Engine Fuel Type'].mode()[0], inplace=True)
df['Engine HP'].fillna(df['Engine HP'].mean(), inplace=True)
df['Market Category'].fillna(df['Market Category'].mode(), inplace=True)
df['Engine Cylinders'].fillna(df['Engine Cylinders'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Engine Fuel Type'].fillna(df['Engine Fuel Type'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Engine HP'].fillna(df['Engine HP'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the inter

In [149]:
df.duplicated().sum()

np.int64(715)

In [150]:
df.drop_duplicates(inplace=True)

In [151]:
df = pd.get_dummies(df, columns=['Make', 'Vehicle Style'], drop_first=True)

In [152]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df[['MSRP', 'Engine HP']] = scaler.fit_transform(df[['MSRP', 'Engine HP']])

In [153]:
df.isnull().sum()

Model                               0
Year                                0
Engine Fuel Type                    0
Engine HP                           0
Engine Cylinders                    0
                                   ..
Vehicle Style_Passenger Minivan     0
Vehicle Style_Passenger Van         0
Vehicle Style_Regular Cab Pickup    0
Vehicle Style_Sedan                 0
Vehicle Style_Wagon                 0
Length: 76, dtype: int64