## Data cleaning and pre-processing
Let's delve into data cleaning and preprocessing, which are crucial steps in any data science project. We will primarily use the pandas library for these tasks.
#### Missing values
Missing values in datasets are quite common. They can be filled in with some specified value, or an aggregated value like mean, median, etc. Alternatively, rows or columns containing missing values can also be dropped.

In [None]:
import pandas as pd
import numpy as np

# Creating a DataFrame with some missing values
df = pd.DataFrame({
    'A': [1, np.nan, 3],
    'B': [4, 5, np.nan],
    'C': [7, 8, 9]
})

# Fill missing values with a specified value
df_filled = df.fillna(0)

# Fill missing values with mean of the column
df_filled_mean = df.fillna(df.mean())

# Drop rows with missing values
df_dropped = df.dropna()

#### Duplicate values
Duplicates in your dataset can negatively impact analysis and prediction results by distorting them. Removing duplicates is crucial for ensuring accurate analyses and predictions.

In [None]:
# Creating a DataFrame with duplicate rows
df = pd.DataFrame({
    'A': ['foo', 'foo', 'bar', 'bar'],
    'B': ['one', 'one', 'two', 'two'],
    'C': [1, 1, 2, 2]
})

# Dropping duplicate rows
df_dropped = df.drop_duplicates()

#### Outliers
Outliers Detection and Removal: Outliers are extreme values that deviate significantly from other observations. They might occur due to variability in the data or measurement errors. We can use statistical methods like the IQR method or Z-score method to detect and remove outliers.

In [None]:
# Assume we have a DataFrame df with a column 'data'
Q1 = df['data'].quantile(0.25)
Q3 = df['data'].quantile(0.75)
IQR = Q3 - Q1

# Define criteria for an outlier
filter = (df['data'] >= Q1 - 1.5 * IQR) & (df['data'] <= Q3 + 1.5 * IQR)

# Apply the filter to remove outliers
df_no_outlier = df.loc[filter]

#### Normalization
Data normalization involves scaling the data to a specific range, typically between 0 and 1. It is necessary when the dataset contains features with varying scale.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Assume we have a DataFrame df with columns 'col1' and 'col2'
scaler = MinMaxScaler()
df[['col1', 'col2']] = scaler.fit_transform(df[['col1', 'col2']])

#### Categorical encoding
Encoding Categorical Variables: Machine learning models generally require numerical input. Hence categorical variables, both ordinal (with order) and nominal (without order), are often encoded to numerical counterparts.

In [None]:
# Ordinal encoding
df['ordinal_var'] = df['ordinal_var'].map({'low': 1, 'medium': 2, 'high': 3})

# One-hot encoding for nominal variables
df = pd.get_dummies(df, columns=['nominal_var'])

## Basic data visualization
#### Matplotlib
Let's start with Matplotlib, a widely-used library for creating static, animated, and interactive visualizations in Python.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.title('Line Plot')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Scatter plot
x = np.random.rand(50)
y = np.random.rand(50)
plt.scatter(x, y)
plt.title('Scatter Plot')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# Bar plot
labels = ['A', 'B', 'C']
values = [10, 35, 50]
plt.bar(labels, values)
plt.title('Bar Plot')
plt.show()

#### Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
import seaborn as sns
import pandas as pd

# Creating a sample DataFrame
df = pd.DataFrame({
    'A': np.random.rand(50),
    'B': np.random.rand(50),
    'C': ['X']*25 + ['Y']*25
})

# Box plot
sns.boxplot(x='C', y='A', data=df)
plt.title('Box Plot')
plt.show()

# Violin plot
sns.violinplot(x='C', y='B', data=df)
plt.title('Violin Plot')
plt.show()

# Pair plot
sns.pairplot(df)
plt.title('Pair Plot')
plt.show()