<a href="https://colab.research.google.com/github/sanjanb/Machine-Learning-basics/blob/main/Heart_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basics**
Basics of Machine Learning
Machine Learning (ML) is a field of artificial intelligence that focuses on building models that can learn from and make predictions on data. Here are some fundamental concepts:

### 1. **Types of Machine Learning**
a. **Supervised Learning** : The model is trained on labeled data.
Classification : Predicting a discrete label (e.g., presence or absence of heart disease).
Regression : Predicting a continuous value (e.g., predicting blood pressure).

b. **Unsupervised Learning** : The model is trained on unlabeled data.
Clustering : Grouping similar data points together.
Dimensionality Reduction : Reducing the number of features while retaining important information.

c. **Reinforcement Learning** : The model learns by interacting with an environment and receiving rewards or penalties.
### 2. **Machine Learning Workflow**
**Data Collection** : Gather and prepare the dataset.
Data Preprocessing : Clean and preprocess the data.
Exploratory Data Analysis (EDA) : Understand the data through visualization and statistics.
**Model Selection** : Choose an appropriate machine learning algorithm.
**Model Training** : Train the model on the training data.
**Model Evaluation** : Evaluate the model's performance on the test data.
**Model Deployment** : Deploy the model for real-world use.
### 3. **Key Libraries**
Pandas : For data manipulation and analysis.

NumPy : For numerical operations.

Scikit-learn : For machine learning algorithms.

Matplotlib/Seaborn : For data visualization.

KaggleHub : For downloading datasets from Kaggle.

In [None]:
import pandas as pd
import os

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("johnsmith88/heart-disease-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/johnsmith88/heart-disease-dataset?dataset_version_number=2...


100%|██████████| 6.18k/6.18k [00:00<00:00, 1.76MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/johnsmith88/heart-disease-dataset/versions/2





In [None]:
# Print the data from the csv
df = pd.read_csv(path)
print(df)

      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0      52    1   0       125   212    0        1      168      0      1.0   
1      53    1   0       140   203    1        0      155      1      3.1   
2      70    1   0       145   174    0        1      125      1      2.6   
3      61    1   0       148   203    0        1      161      0      0.0   
4      62    0   0       138   294    1        1      106      0      1.9   
...   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
1020   59    1   1       140   221    0        1      164      1      0.0   
1021   60    1   0       125   258    0        0      141      1      2.8   
1022   47    1   0       110   275    0        0      118      1      1.0   
1023   50    0   0       110   254    0        0      159      0      0.0   
1024   54    1   0       120   188    0        1      113      0      1.4   

      slope  ca  thal  target  
0         2   2     3       0  
1         0

In [None]:
df.head()
df.tail()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [None]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [None]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


In [None]:
df.nunique()

Unnamed: 0,0
age,41
sex,2
cp,4
trestbps,49
chol,152
fbs,2
restecg,3
thalach,91
exang,2
oldpeak,40


In [None]:
df.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [None]:
df.shape

(1025, 14)

In [None]:
df.duplicated()

Unnamed: 0,0
0,False
1,False
2,False
3,False
4,False
...,...
1020,True
1021,True
1022,True
1023,True


# **Using sklearn**

## **Basic Functions**

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#Load the data set
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Display  dataset information
print("dataset overview:")
print(df.head())
print("\nDataset information:")
print(df.info())
print("\nDataset description:")
print(df.describe())

dataset overview:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   target          

## **Training the data**

In [None]:
# split the data into features and target
x = df.drop('target', axis=1)
y = df['target']

# split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42)

# standardize the features
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
# print confirmarion
print("data preprocessing completed successfully")

data preprocessing completed successfully


## **Linear Regression**



In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
y_pred_lin = lin_reg.predict(x_test)

print("\n Linear Regression Results:")
print("\n Linear Regression coefficients:",lin_reg.coef_)


 Linear Regression Results:

 Linear Regression coefficients: [-0.09543703 -0.02673551  0.44483159  0.41023025]


## **Logistic Regresssion**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression()
log_reg.fit(x_train, y_train)
y_pred_log = log_reg.predict(x_test)

print("\n Logistic Regression accuracy:", accuracy_score(y_test, y_pred_log))


 Logistic Regression accuracy: 1.0


## **Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt_clf = DecisionTreeClassifier()
dt_clf.fit(x_train, y_train)
y_pred_dt = dt_clf.predict(x_test)
print("\n Decision Tree accuracy:", accuracy_score(y_test, y_pred_dt))


 Decision Tree accuracy: 1.0


## **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_clf = RandomForestClassifier()
rf_clf.fit(x_train, y_train)
y_pred_rf = rf_clf.predict(x_test)
print("\n Random Forest accuracy:", accuracy_score(y_test, y_pred_rf))


 Random Forest accuracy: 1.0


## **Support Vector Machine - SVM**

In [4]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

svm_model = SVC()
# Scale the training and testing data using StandardScaler and store them in x_train_scaled and x_test_scaled variables
# Assuming 'scaler' is the StandardScaler object you created earlier:
svm_model.fit(x_train_scaled, y_train)
y_pred_svm = svm_model.predict(x_test_scaled)
print("\n SVM accuracy:", accuracy_score(y_test, y_pred_svm))

NameError: name 'x_train_scaled' is not defined

## **K-means Clustering**

*   find centroid points
*   repeat the processs until all the daa is covered



NameError: name 'X_scaled' is not defined