<a href="https://colab.research.google.com/github/svganapathi/NM-Course/blob/main/Data_Python_for_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Print a Simple Statement**

In [1]:
print("Hello, Excited to start Data Science.")

Hello, Excited to start Data Science.


## **Python Basics**

**Variable Assignment (Storing Customer Data)**

In [2]:
customer_name = "John Doe"
customer_age = 28
customer_balance = 120.75
print("Customer:", customer_name)
print("Age:", customer_age)
print("Balance:", customer_balance)

Customer: John Doe
Age: 28
Balance: 120.75


**Arithmetic Operations (Applying Discount)**

In [3]:
product_price = 200
discount = product_price * 0.10 # 10% discount
final_price = product_price - discount
print("Final Price after Discount:", final_price)

Final Price after Discount: 180.0


## **NumPy Basics**

**Create NumPy Arrays (Sales Data)**

In [4]:
import numpy as np
sales = np.array([150, 200, 250, 300, 400, 350, 500]) # Sales for each day
print("Sales Data:", sales)

Sales Data: [150 200 250 300 400 350 500]


**Statistical Analysis (Sales Performance)**

In [5]:
print("Average Sales:", np.mean(sales))
print("Highest Sale:", np.max(sales))
print("Lowest Sale:", np.min(sales))

Average Sales: 307.14285714285717
Highest Sale: 500
Lowest Sale: 150


## **Pandas Basics**

**Create a DataFrame (Customer Transactions)**

In [6]:
import pandas as pd
data = {
"Customer": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"Amount Spent": [120, 200, 150]
}
df = pd.DataFrame(data)
print(df)

  Customer  Age  Amount Spent
0    Alice   25           120
1      Bob   30           200
2  Charlie   35           150


**Load and View a CSV File**

In [None]:
from google.colab import files
uploaded = files.upload() # Upload your CSV file
df = pd.read_csv("customer_data.csv") # Replace with your file name
df.head()

## **Data Manipulation with Pandas**

**Filter High-Spending Customers**

In [None]:
high_spenders = df[df["Amount Spent"] > 150]
print(high_spenders)

**Sorting Customers by Spending**

In [None]:
df_sorted = df.sort_values(by="Amount Spent", ascending=False)
print(df_sorted)

**Add a New Column (Loyalty Points Calculation)**

In [None]:
df["Loyalty Points"] = df["Amount Spent"] // 10
print(df)

**Saving Processed Data**

In [None]:
df.to_csv("cleaned_customer_data.csv", index=False)
files.download("cleaned_customer_data.csv") # Download the file

---
---

## **Handling Missing Values & Duplicates**

In [None]:
# Check for missing values
print(df.isnull().sum())

**Remove Missing Values**

In [None]:
df_cleaned = df.dropna() # Removes rows with missing values
print(df_cleaned)

**Fill Missing Values (Imputation)**

`Fill with Mean/Median (For Numerical Data)`

In [None]:
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Marks"].fillna(df["Marks"].median(), inplace=True)
df["Attendance"].fillna(df["Attendance"].mean(), inplace=True)

`Fill with Mode (For Categorical Data)`

In [None]:
df["Passed"].fillna(df["Passed"].mode()[0], inplace=True)

`Forward Fill & Backward Fill`

In [None]:
df.ffill(inplace=True) # Forward fill
df.bfill(inplace=True) # Backward fill

**Remove Duplicates**

In [None]:
df.drop_duplicates(inplace=True)

## **Data Transformation: Scaling & Encoding**

`Feature Scaling`

**Standardization (Z-score Normalization)**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[["Marks", "Attendance"]] = scaler.fit_transform(df[["Marks", "Attendance"]])
print(df_scaled)

**Min-Max Scaling**

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled[["Marks", "Attendance"]] = scaler.fit_transform(df[["Marks", "Attendance"]])
print(df_scaled)

`Encoding Categorical Variables`

**One-Hot Encoding**

In [None]:
df_encoded = pd.get_dummies(df, columns=["Passed"], drop_first=True)
print(df_encoded)

Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["Passed"] = encoder.fit_transform(df["Passed"])
print(df)

## **Feature Engineering**

**Deriving a new feature: Performance Category**

In [None]:
def performance_category(marks):
if marks >= 85:
return "High"
elif marks >= 70:
return "Medium"
else:
return "Low"
df["Performance"] = df["Marks"].apply(performance_category)
print(df)

**Binning (Converting Continuous to Categorical Data)**

In [None]:
df["Age_Group"] = pd.cut(df["Age"], bins=[18, 21, 24], labels=["Young", "Adult"])
print(df)

---
---

# **Data Visualization**

### **Matplotlib**

**Basic Matplotlib Plot**

`Line Plot`

In [None]:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y, label="Sine Wave")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.title("Simple Line Plot")
plt.legend()
plt.show()

`Bar Chart`

In [None]:
categories = ['A', 'B', 'C', 'D']
values = [10, 25, 15, 30]
plt.figure(figsize=(6, 4))
plt.bar(categories, values, color='purple')
plt.xlabel("Categories")
plt.ylabel("Values")
plt.title("Bar Chart Example")
plt.show()

`Histogram (Distribution of Data)`

In [None]:
data = np.random.randn(1000)
plt.figure(figsize=(7, 5))
plt.hist(data, bins=30, color='green', edgecolor='black', alpha=0.7)
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram Example")
plt.show()

**Seaborn - Statistical Data Visualization**

`Histogram & KDE Plot`

In [None]:
import seaborn as sns
import pandas as pd

# Creating sample data
data = np.random.randn(1000)

df = pd.DataFrame(data, columns=['Values'])

# Plot
sns.histplot(df['Values'], bins=30, kde=True, color='blue')
plt.title("Histogram with KDE")
plt.show()

`Box Plot (Detecting Outliers)`

In [None]:
tips = sns.load_dataset('tips')
plt.figure(figsize=(6, 4))
sns.boxplot(x=tips['total_bill'])
plt.title("Box Plot of Total Bill")
plt.show()

`Pair Plot (Exploring Relationships)`

In [None]:
sns.pairplot(tips, hue='sex')
plt.show()

`Heatmap (Correlation Analysis)`

In [None]:
# Convert 'sex' and 'smoker' columns to numerical representations using one-hot encoding
tips = pd.get_dummies(tips, columns=['sex', 'smoker', 'day', 'time'])

# Now calculate the correlation matrix
corr_matrix = tips.corr()

plt.figure(figsize=(12, 10))  # Adjust figure size for better readability
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

**Pyplot**

`Interactive Line Plot`

In [None]:
import plotly.express as px
df = pd.DataFrame({
"x": np.linspace(0, 10, 100),
"y": np.sin(np.linspace(0, 10, 100))
})
fig = px.line(df, x='x', y='y', title="Interactive Sine Wave")
fig.show()

`Interactive Scatter Plot`

In [None]:
fig = px.scatter(tips, x='total_bill', y='tip', color='sex', size='size', title="Total Bill vs Tip")
fig.show()

`Interactive 3D Scatter Plot`

In [None]:
import plotly.graph_objects as go
fig = go.Figure(data=[go.Scatter3d(
x=tips['total_bill'],
y=tips['tip'],
z=tips['size'],
mode='markers',
marker=dict(size=5, color=tips['total_bill'], colorscale='Viridis'))])
fig.update_layout(title="3D Scatter Plot of Total Bill, Tip & Size")
fig.show()

---
---

# **Exploratory Data Analysis (EDA)**

# Measures of Dispersion (Variance, Standard Deviation, IQR)

**Practical Example: Predicting Customer Spending**

`Step 1: Calculate the Mean`

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
# Sample dataset
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# Mean
mean_value = np.mean(data)
print(f"Mean: {mean_value}”)

`Step 2: Compute Variance`

In [None]:
# Variance
variance = np.var(data, ddof=1) # Sample variance
# Standard Deviation
std_dev = np.std(data, ddof=1)
print(f"Variance: {variance}, Standard Deviation: {std_dev}")

## **Correlation Analysis:**

**What is a Heatmap?**

`A heatmap is a colorful visualization that shows the strength of relationships between multiple variables in a dataset.`

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Sample Data: Student Scores in Different Subjects
data = {
'Math': [80, 85, 78, 90, 88, 92, 76, 89, 95, 84],
'Science': [75, 82, 79, 91, 87, 95, 72, 88, 97, 83],
'English': [85, 80, 78, 88, 90, 85, 76, 89, 92, 81]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Generate Correlation Matrix
correlation_matrix = df.corr()
# Plot Heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Student Scores Correlation Heatmap")
plt.show()

## **Identifying Outliers**

**Box Plots and IQR Method**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = [10, 12, 14, 15, 17, 20, 30, 100] # 100 is an outlier
# Convert to DataFrame
df = pd.DataFrame(data, columns=['values'])
# Calculate Q1, Q3, and IQR
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1
# Define outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['values'] < lower_bound) | (df['values'] > upper_bound)]
# Plot Boxplot
plt.boxplot(df['values'])
plt.title("Box Plot to Detect Outliers")
plt.show()
# Print outliers
print("Outliers:\n", outliers)

**Z-Score Method**

In [None]:
from scipy import stats
# Convert to NumPy array
data_array = np.array(data)
# Calculate Z-scores
z_scores = np.abs(stats.zscore(data_array))
# Find outliers (Z-score > 2)
outliers = data_array[z_scores > 2]
print("Outliers using Z-score method:", outliers)

**Visualizing Outliers with Scatter Plots**

In [None]:
plt.scatter(range(len(data)), data, color='blue', label="Data Points")
plt.scatter([data.index(100)], [100], color='red', label="Outlier") # Highlight outlier
plt.xlabel("Index")
plt.ylabel("Values")
plt.title("Scatter Plot Showing Outlier")
plt.legend()
plt.show()

# **Categorical Data Analysis**

**Value Counts and Frequency Tables**

In [None]:
import pandas as pd
# Sample categorical data
data = pd.DataFrame({'Category': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Apple']})
# Count occurrences
print(data['Category'].value_counts())
# Frequency table (percentage)
print(data['Category'].value_counts(normalize=True) * 100)

**Bar Plots and Count Plots**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Sample data
categories = ['Apple', 'Banana', 'Apple', 'Orange', 'Banana', 'Apple']
df = pd.DataFrame({'Category': categories})
# Bar Plot
df['Category'].value_counts().plot(kind='bar', color=['red', 'yellow', 'orange'])
plt.xlabel("Category")
plt.ylabel("Count")
plt.title("Bar Plot of Categorical Data")
plt.show()
# Count Plot (Using Seaborn)
sns.countplot(x=df['Category'], palette="pastel")
plt.title("Count Plot of Categorical Data")
plt.show()

**Encoding Categorical Variables**

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Sample data
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana']})
# Label Encoding
label_encoder = LabelEncoder()
df['Fruit_Label'] = label_encoder.fit_transform(df['Fruit'])
print(df)
# One-Hot Encoding
df_one_hot = pd.get_dummies(df['Fruit'])
print(df_one_hot)

## **Automated EDA Reports**

**Using pandas-profiling for Quick Insights**

In [None]:
!pip install ydata-profiling
import pandas as pd
from ydata_profiling import ProfileReport # Correct import
# Load sample dataset
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
# Generate profile report
profile = ProfileReport(df, explorative=True)
# Display report in Colab
profile.to_notebook_iframe()
import pandas as pd
from ydata_profiling import ProfileReport # Correct import
# Load sample dataset
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
# Generate profile report
profile = ProfileReport(df, explorative=True)
# Display report in Colab
profile.to_notebook_iframe()
profile.to_file("titanic_report.html")

**Generating Sweetviz Reports**

In [None]:
pip install sweetviz
import sweetviz as sv
# Generate a report
report = sv.analyze(df)
# Show the report in a browser
report.show_html("titanic_sweetviz.html")

**Exploring Data with dtale**

In [None]:
pip install dtale
import dtale
# Launch D-Tale dashboard
dtale.show(df)