# Unlocking Insights of Data Science using Projects

# 1. Wine Review Analysis

This project involves the analysis of wine and to achieve this goal, the project involves several steps here I will mention few important steps to achive(for complete modelling project click to the path- https://github.com/sukanya789/ML_practice_project/blob/main/redwine_quality_analysis.ipynb). Initially starting with setting up data science environment, proceeds with data manipulation or preprocessing with pandas, Descriptive analysis and matrix calculation with NumPy, and Data Visualization with Matplotlib.

a. SetUp Data Science Environment

* Using Terminal:

In [None]:
# Navigating to the directory where we want to store notebook files
cd /path/to/your/project/directory

# Create a virtual environment
python -m venv dsenv

# Activate VE
source dsenv/bin/activate

# Install necessary libraries
pip install pandas numpy matplotlib scikit-learn jupyterlab

# Start Jupyter notebook
jupyter notebbook

# Deactivate(at the completion of task)
deactivate

* Using local environment like Jupyter Notebook & Google Colab:

In [None]:
import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])
    
# Install VirEnv if not already installed and Set up VirEnv if necessary 

try:
    import virtualenv
except ImportError:
    install('virtualenv')
    
# Install Necessary Libraries
libraries = ['pandas', 'numpy', 'matplotlib']
for lib in libraries:
    install(lib)
    
# Create a VirEnv
subprocess.run(['virtualenv', 'dsenv'])

# Activate
if sys.platform.startwith('win'):
    activate_script = 'dsenv\\Scripts\\activate_this.py'
else:
    activate_script = 'dsenv/bin/activate_this.py'
    
with open(activate_script, "r") as file:
    exec(file.read(), dict(__file__ = activate_script))
        
print("Data Science Environment Setup Complete.")       

b. Data Manipulation with Pandas

Here I am presenting some basic steps of this step of modelling

In [None]:
import pandas as pd

# load dataset
df = pd.read_csv('') # saved dataset name

# Perform basic data exploration and preprocessing steps
print(df.head())  # Print the first few rows of the DataFrame
print(df.dtypes)  # Print the data types of each column in the DataFrame
print(df.describe()) # Generate summary statistics of the DataFrame
print(df.isnull().sum()) # Count missing values in each column of the DataFrame
print(df.loc[df['column_name'] > value])  # Filter rows based on a condition
print(df.sort_values(by='column_name', ascending=False))  # Sort DataFrame by a column
print(df.groupby('grouping_column').mean())  # Group data and calculate the mean

# data preprocessing
df = df.dropna() # drop columns/rows with missing values
df = df.drop_duplicates() #remove duplicates rows
merged_df = pd.merge(df1, df2, on='key_column')  # Merge two DataFrames
df.fillna(value, inplace=True)  # Fill missing values with specified value
df['new_column'] = df['column_one'] * df['column_two']  # Perform element-wise multiplication to create a new column

c. Descriptive Analysis and Matrix calculation with NumPy (As per requirement of dataset)

In [None]:
import numpy as np

# Descriptive Analysis
print("Descriptive Analysis:")
print("Mean:", np.mean(data, axis=0))
print("Median:", np.median(data, axis=0))
print("Standard Deviation:", np.std(data, axis=0))

# For matrix calculations

# Calculate the covariance matrix
cov_matrix = np.cov(X.T)
print("\nCovariance Matrix:")
print(cov_matrix)

# Calculate the correlation matrix
corr_matrix = np.corrcoef(X.T)
print("\nCorrelation Matrix:")
print(corr_matrix)

d. Data visualtization with Matplotlib

In [None]:
import matplotlib.pyplot as plt

# Plot Graphs and maps for visualization

# Histogram of a feature (e.g., 'alcohol')
plt.figure(figsize=(10,6))
plt.hist(df['alcohol'], bins=30, color='purple', alpha=0.7)
plt.title('Distribution of Alcohol in % Vol')
plt.xlabel('Alcohol in % Vol')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Scatter plot of two features (e.g., 'alcohol' and 'quality')
plt.figure(figsize=(10,6))
plt.scatter(df['alcohol'], df['quality'], color='blue', alpha=0.7)
plt.title('Alcohol vs Quality')
plt.xlabel('Alcohol in % Vol')
plt.ylabel('Quality')
plt.grid(True)
plt.show()

# Box plot
plt.figure(figsize=(10,6))
plt.boxplot(df['alcohol'], notch=True, vert=False)
plt.title('Box Plot of Alcohol in % Vol')
plt.xlabel('Alcohol in % Vol')
plt.grid(True)
plt.show()

# Bar plot, let's assume 'quality' is a categorical variable
quality_counts = df['quality'].value_counts()

plt.figure(figsize=(10,6))
plt.bar(quality_counts.index, quality_counts.values, color='green', alpha=0.7)
plt.title('Bar Plot of Wine Quality')
plt.xlabel('Quality')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Pie chart
quality_counts = df['quality'].value_counts()

plt.figure(figsize=(10,6))
plt.pie(quality_counts.values, labels=quality_counts.index, autopct='%1.1f%%')
plt.title('Pie Chart of Wine Quality')
plt.show()


 # 2. Housing Price Prediction

The aim of this project is to predict housing prices using various features. This project will help us understand the application of data science in real estate markets and how different factors influence the price of a house.
Here we will see the steps to perform data preprocessing, descriptive analysis, data visualization, feature engineering, how to train a model to predict the housing prices. We will also see how to evaluate the performance of  model.

a. Data Manipulation /Preprocessing with Pandas

This steps is almost same for every machine learning (data science) project

In [None]:
import pandas as pd

# load dataset
df = pd.read_csv('') # saved dataset name

# Perform basic data exploration and preprocessing steps
print(df.head())  
print(df.dtypes)  
print(df.describe()) 
print(df.isnull().sum()) 
print(df.loc[df['column_name'] > value])  
print(df.sort_values(by='column_name', ascending=False))  
print(df.groupby('grouping_column').mean())  

# data preprocessing
df = df.dropna() 
df = df.drop_duplicates() 
merged_df = pd.merge(df1, df2, on='key_column')  
df.fillna(value, inplace=True)  
df['new_column'] = df['column_one'] * df['column_two']  

b. Descriptive analysis and matrix calculation with NumPy

In [None]:
import numpy as np

# Descriptive Analysis
print("Descriptive Analysis:")
print(df.describe())

# Matrix Calculations
X = df.drop('price', axis=1).values  # feature matrix
y = df['price'].values  # target vector

# Covariance and Correlation matrices
print("\nCovariance Matrix:")
print(np.cov(X.T))

print("\nCorrelation Matrix:")
print(np.corrcoef(X.T))

c. Data Visualization with Matplotlib & Seaborn

Then we will be exploring two important Python libraries Matplotlib and Seaborn, that are considerably used for data visualization.
Matplotlib - It is one of the most popular data visualization libraries in Python. It's a low- position library with a Matlab like interface which offers lots of freedom at the cost of having to write further law. It’s an excellent choice for creating simple plots,multi-plots, and handling different kinds of data visualizations. 

Seaborn - It provides a high- position interface to Matplotlib. It uses smaller syntax and has stunning dereliction themes and a rich collection of visualizations including complex types like time series, violin plots, and common plots. Seaborn works well with Pandas DataFrames, making it easier to parse your data and produce beautiful and instructional statistical models.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Pairplot for visualizing relationships between features
sns.pairplot(df)
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Distribution plot
sns.distplot(df['price'], bins=30)
plt.title('Distribution of Prices')
plt.show()

# Box plot(Categorial Columns)
plt.figure(figsize=(10,6))
sns.boxplot(x='location', y='price', data=df)
plt.title('Box Plot of Prices for each Location')
plt.xticks(rotation=90)
plt.show()

# Scatter plot
plt.figure(figsize=(10,6))
sns.scatterplot(x='area', y='price', data=df)
plt.title('Scatter Plot of Area vs Price')
plt.show()

# these are few visualization steps

d. Feature Engineering with Scikit-learn

Feature engineering is the process of transforming raw data into features that better represent the problem to the models, enhancing their performance. Scikit-learn, a Python library for machine learning, offers several tools for feature engineering:

Preprocessing: Standardizes or normalizes features.

Feature Extraction: Extracts features from text and images.

Feature Selection: Selects the most informative features.

Dimensionality Reduction: Reduces the number of variables to consider.

In this section, we will use Scikit-learn’s functionalities to prepare our data for machine-learning models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling using Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Feature Extraction using Principal Component Analysis (PCA) to reduce the dimensionality of the data
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

e. Regression algorithms model training with Scikit-learn and Pickle

Training Regression Models with Scikit-learn and Pickle

Regression algorithms predict continuous outcomes. With Scikit-learn, a Python library, we can train these models efficiently. Post-training, we can save the models using Pickle, a Python module for object serialization, enabling their reuse across platforms and sessions.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pickle

# Model Training
model = LinearRegression()
model.fit(X_train, y_train)

# Save the model
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

f. Evaluation of machine learning model

Evaluating a machine learning model is as crucial as training it. Model evaluation involves assessing the performance of the model in predicting the outcomes of unseen data. It helps us understand the robustness of the model, its generalization ability, and how well it has captured the underlying patterns in the data. Common evaluation metrics include accuracy, precision, recall, F1-score for classification tasks, and mean absolute error, mean squared error, and R-squared for regression tasks. 

In [None]:
# Model Evaluation
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print("Mean Squared Error:", mse)