# 1. Understanding the Dataset
Objective: Familiarize yourself with the dataset by loading it and inspecting its features.
Tasks:
Load a dataset (e.g., from Kaggle or any open-source platform).
Understand the target variable and independent variables.
Check data types, column names, and shape of the dataset.
Example: Use Pandas to load the dataset and check its structure:

In [2]:
import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.info())
print(df.describe())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15053 entries, 0 to 15052
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Rank                 15053 non-null  int64  
 1   TeamId               15053 non-null  int64  
 2   TeamName             15053 non-null  object 
 3   LastSubmissionDate   15053 non-null  object 
 4   Score                15053 non-null  float64
 5   SubmissionCount      15053 non-null  int64  
 6   TeamMemberUserNames  15053 non-null  object 
dtypes: float64(1), int64(3), object(3)
memory usage: 823.3+ KB
None
               Rank        TeamId         Score  SubmissionCount
count  15053.000000  1.505300e+04  15053.000000     15053.000000
mean    7526.000000  1.222986e+07      0.765119         3.621139
std     4345.571136  1.696860e+06      0.080605         6.319934
min        0.000000  3.689700e+04      0.000000         1.000000
25%     3763.000000  1.254842e+07      0.765550 

# 2. Handling Missing Data
Objective: Identify and handle any missing values in the dataset.
Tasks:
Identify missing values using .isnull().sum().
Decide whether to remove, fill, or impute missing data

In [3]:
print(df.isnull().sum())
# Fill missing values
df['Score'].fillna(df['Score'].mean(), inplace=True)


Rank                   0
TeamId                 0
TeamName               0
LastSubmissionDate     0
Score                  0
SubmissionCount        0
TeamMemberUserNames    0
dtype: int64


# 3. Data Visualization
Objective: Visualize the data to understand patterns, correlations, and outliers.
Tasks:
Use libraries like Matplotlib and Seaborn to plot distributions, correlations, and relationships.
Common plots: histograms, boxplots, pair plots, and correlation heatmaps.

In [3]:
import seaborn 
import matplotlib.pyplot as plt

# Plot a histogram
seaborn.histplot(df['Rank'], kde=True)
plt.show()

# Correlation heatmap
seaborn.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()


ModuleNotFoundError: No module named 'seaborn'

In [None]:
!pip uninstall seaborn