# Data Cleaning and Preparation

 Data cleaning and preparation is a crucial step in the data analysis process. Raw data is often messy and contains missing values, outliers, duplicates, and other inconsistencies that could potentially skew your analysis. This process, often referred to as "data munging," can take up a significant portion of a data scientist's time, but it's absolutely essential to making accurate predictions. Fortunately, libraries such as pandas provide numerous tools to make this process easier.🐼🐍

In [22]:
# Assuming `df` is your DataFrame and it has missing values
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan,1,1],
                   'B': [5, np.nan, np.nan,1,1],
                   'C': [1, 2, 3,1,1]})
print(df)

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3
3  1.0  1.0  1
4  1.0  1.0  1


___

### Understanding the Data

In [4]:
# View the first few rows of the dataset
print(df.head())

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3


In [5]:
# Check the summary statistics of the dataset
print(df.describe())

              A    B    C
count  2.000000  1.0  3.0
mean   1.500000  5.0  2.0
std    0.707107  NaN  1.0
min    1.000000  5.0  1.0
25%    1.250000  5.0  1.5
50%    1.500000  5.0  2.0
75%    1.750000  5.0  2.5
max    2.000000  5.0  3.0


In [6]:
# Check the data types of each column
print(df.dtypes)

A    float64
B    float64
C      int64
dtype: object


___

#### 🚫 Filtering Out Missing Data
Often, the easiest way to handle missing data is to simply exclude the offending entries. pandas provides the dropna() function for this very purpose:

In [19]:
# Assuming `df` is your DataFrame and it has missing values
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


#### 🔄 Filling In Missing Data
However, deleting missing values is not always the optimal approach, especially if it results in losing a lot of data. In such cases, it might be better to impute the missing values - i.e., fill them in based on the other values in the dataset. The fillna() method helps us accomplish this:



In [24]:
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3
3,1.0,1.0,1
4,1.0,1.0,1


#### 🚫 Removing Duplicates
Duplicate data can occur for a variety of reasons, most often as a result of errors during data collection. pandas provide the duplicated() and drop_duplicates() functions to help identify and remove duplicate rows:

In [23]:
# Assuming `df` is your DataFrame and it has duplicate rows
df.drop_duplicates()

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3
3,1.0,1.0,1


#### ➡️ Replacing Values
Sometimes we need to replace values in a DataFrame. This might be because the value is erroneous, or because we want to standardize our data. The replace() function is designed for this:
Assuming `df` is your DataFrame and we want to replace 'old_value' with 'new_value'
```python
df.replace('old_value', 'new_value')
```



In [25]:
ser = pd.Series([1., -999., 2., -999., -1000., 3.])
print(ser.replace(-999, np.nan))

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64
