## Data Cleaning and Transformation

#### 1. Handling Missing Data
- Pandas provides several ways to handle missing data, including detecting, dropping, and filling missing values.




In [1]:
import pandas as pd

In [2]:
# Creating a DataFrame with missing data
data = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, None, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', None]
})

# Detecting missing data
print(data.isna())

# Dropping rows with missing data
print(data.dropna())

# Filling missing data
print(data.fillna('Unknown'))

    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False  False
3  False  False   True
    Name   Age      City
0   John  28.0  New York
2  Peter  35.0    Berlin
    Name      Age      City
0   John     28.0  New York
1   Anna  Unknown     Paris
2  Peter     35.0    Berlin
3  Linda     32.0   Unknown


### 2. Detecting and Handling Duplicates
Pandas allows you to detect and remove duplicate rows.

In [3]:
# Detecting duplicates
print(data.duplicated())

# Removing duplicates
data_no_duplicates = data.drop_duplicates()
print(data_no_duplicates)

0    False
1    False
2    False
3    False
dtype: bool
    Name   Age      City
0   John  28.0  New York
1   Anna   NaN     Paris
2  Peter  35.0    Berlin
3  Linda  32.0      None


### 3. Data Type Conversion (astype())
You can convert the data type of a column using the astype() method.

In [4]:
# Converting 'Age' column to integer (after filling NaN)
data['Age'] = data['Age'].fillna(0).astype(int)
print(data.dtypes)

Name    object
Age      int32
City    object
dtype: object


### 4. Renaming Columns and Indexes
Rename columns or indexes using rename().

In [5]:
# Renaming a column
data = data.rename(columns={'Name': 'Full Name'})
print(data)

# Renaming index
data = data.rename(index={0: 'Row1', 1: 'Row2'})
print(data)

  Full Name  Age      City
0      John   28  New York
1      Anna    0     Paris
2     Peter   35    Berlin
3     Linda   32      None
     Full Name  Age      City
Row1      John   28  New York
Row2      Anna    0     Paris
2        Peter   35    Berlin
3        Linda   32      None


### 5. Applying Functions (apply(), map(), applymap())
You can apply functions to DataFrames using apply() for rows/columns, map() for Series, and applymap() for element-wise application.

In [6]:
# Applying a function to a column using apply()
data['Age'] = data['Age'].apply(lambda x: x + 1)
print(data)

# Using map() on a Series
data['City'] = data['City'].map({'New York': 'NY', 'Paris': 'FR', 'Berlin': 'DE'})
print(data)

# Using applymap() for element-wise operations
data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(data.applymap(lambda x: x * 2))

     Full Name  Age      City
Row1      John   29  New York
Row2      Anna    1     Paris
2        Peter   36    Berlin
3        Linda   33      None
     Full Name  Age City
Row1      John   29   NY
Row2      Anna    1   FR
2        Peter   36   DE
3        Linda   33  NaN
   A   B
0  2   8
1  4  10
2  6  12


  print(data.applymap(lambda x: x * 2))


### 6. Removing Unwanted Data (Filtering)
You can filter out unwanted rows or columns based on conditions.

In [7]:
# Creating a DataFrame (ensure 'Age' column is present)
data = pd.DataFrame({
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
})

# Check the columns in the DataFrame to verify 'Age' is present
print("Columns in DataFrame:", data.columns)

# Filtering out rows where 'Age' is less than 30
try:
    filtered_data = data[data['Age'] >= 30]
    print(filtered_data)
except KeyError as e:
    print(f"Error: {e}. Column not found in the DataFrame.")


Columns in DataFrame: Index(['Name', 'Age', 'City'], dtype='object')
    Name  Age    City
2  Peter   35  Berlin
3  Linda   32  London


### 7. Changing DataFrame Layout (Pivot, Melt, Stack, Unstack)
You can reshape the DataFrame using pivot(), melt(), stack(), and unstack().

In [8]:
# Creating a new DataFrame for layout transformations
df_layout = pd.DataFrame({
    'Name': ['John', 'Anna', 'John', 'Anna'],
    'Year': [2021, 2021, 2022, 2022],
    'Sales': [500, 700, 600, 800]
})

# Pivoting the DataFrame
pivoted = df_layout.pivot(index='Name', columns='Year', values='Sales')
print(pivoted)

# Melting the DataFrame (reverse of pivot)
melted = pd.melt(pivoted.reset_index(), id_vars='Name', value_vars=[2021, 2022])
print(melted)

# Stacking and Unstacking
stacked = pivoted.stack()
print(stacked)

unstacked = stacked.unstack()
print(unstacked)

Year  2021  2022
Name            
Anna   700   800
John   500   600
   Name  Year  value
0  Anna  2021    700
1  John  2021    500
2  Anna  2022    800
3  John  2022    600
Name  Year
Anna  2021    700
      2022    800
John  2021    500
      2022    600
dtype: int64
Year  2021  2022
Name            
Anna   700   800
John   500   600


### 8. Replacing Values
Replace specific values in the DataFrame using the replace() method.

In [9]:
# Replacing values in the 'City' column
data['City'] = data['City'].replace({'NY': 'New York', 'FR': 'Paris', 'DE': 'Berlin'})
print(data)

    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London
