### Pandas Data Cleaning and Exploration

First, we import the pandas library, which is essential for working with DataFrames.

In [2]:
import pandas as pd

We load a dataset from a CSV file into a pandas DataFrame. This is the first step in any data analysis task.

In [3]:
df = pd.read_csv('missing_values_data.csv')

The `.head()` method is used to display the first few rows of the DataFrame. This is a quick way to inspect the data and understand its structure.

In [29]:
df.head()

Similarly, `.tail()` shows the last few rows of the DataFrame.

In [30]:
df.tail()

`.info()` provides a concise summary of the DataFrame, including the data types and non-null values for each column.

In [6]:
df.info()

`.describe()` generates descriptive statistics, such as mean, standard deviation, and quartiles, for the numerical columns.

In [7]:
df.describe()

`.isna().sum()` is a crucial command for identifying missing values. It returns the total number of `NaN` (Not a Number) values in each column.

In [8]:
df.isna().sum()

`.duplicated()` helps in identifying duplicate rows in the DataFrame. It returns a boolean Series indicating whether each row is a duplicate of a previous one.

In [9]:
df.duplicated()

Here, we remove the duplicate rows and create a new DataFrame `df2` to store the cleaned data.

In [10]:
df2 = df.drop_duplicates()

Displaying the new DataFrame without the duplicate rows.

In [11]:
df2

We fill the missing 'Age' values in `df2` with the mean of the original 'Age' column. `inplace=True` modifies the DataFrame directly.

In [12]:
df2['Age'].fillna(df['Age'].mean(), inplace = True)

Displaying the DataFrame after filling the missing 'Age' values. Notice the values are now floats because the mean is a float.

In [35]:
df2

`.mode()` can return multiple values if there is a tie. We use `.mode()[0]` to select the first mode as the fill value.

In [15]:
df2['Join Date'].fillna(df['Join Date'].mode()[0], inplace=True)

Displaying the DataFrame after filling the missing 'Join Date' values.

In [16]:
df2

`.str.contains()` is used for string operations. Here, we check which rows in the 'City' column contain the string 'New york', ignoring case.

In [18]:
df['City'].str.contains('New york',case = False)

We can convert all names to lowercase using `.str.lower()`.

In [19]:
df2['Name'].str.lower()

Or convert them to uppercase with `.str.upper()`.

In [20]:
df2['Name'].str.upper()

`.str.title()` capitalizes the first letter of each word in the 'Name' column.

In [21]:
df['Name'].str.title()

First, we fill the missing 'Age' values with the mean. Then, we round the values and convert the column to integer type to avoid floating-point numbers.

In [36]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Age'] = df['Age'].round().astype(int)

Displaying the DataFrame after filling and converting the 'Age' column.

In [24]:
df

This is a boolean indexing operation. It filters the DataFrame to show only the rows where the 'Age' is greater than 20 AND the 'Gender' is 'M'.

In [25]:
df[(df['Age'] > 20) & (df['Gender'] == 'M')]

To count the number of rows where the 'Age' is 26 AND the 'City' is 'Delhi', we wrap the boolean conditions in parentheses and then use `.sum()`.

In [26]:
((df['Age'] == 26) & (df['City'] == 'Delhi')).sum()

`.str.split('@')` splits the 'Email' column by the '@' symbol, which is useful for separating usernames from domain names.

In [28]:
df['Email'].str.split('@')