# **Machine Learning**

![ML.png](attachment:ML.png)

# **Data Preprocessing**

#### **1- Data Scaling**

**Min-Max Scaling (Normalization):**

- Scales the data to a specific range, often [0, 1].
- Useful when features have a known upper and lower bound and you want to preserve the relationship between data points.

![download.png](attachment:download.png)

In [None]:
#  from sklearn.preprocessing import MaxMinScaler
#  scaler = MaxMinscaler(feature_range)
#  X = scaler.fit_transform(X)
# y = scaler.fit_transform(y.values.reshape(-1, 1))

**Standardization (Z-score Scaling):**

- Standardizes data to have a mean (average) of 0 and a standard deviation of 1.
- It is suitable when the data has a Gaussian (normal) distribution or when you want to remove the mean and have unit variance.

![download.webp](attachment:download.webp)

In [None]:
#  from sklearn.preprocessing import StandardScaler
#  scaler = StandardScaler()
#  X = scaler.fit_transform(X)
# y = scaler.fit_transform(y.value.reshape(-1, 1))

#### **2. Encoding**

**Label Encoding:**

- Label encoding is a technique for converting categorical data into numerical data.
- It assigns a unique integer (label) to each category or class in a categorical variable.
- Often used for categorical variables with ordinal relationships, where the order of categories matters.

![download.jpg](attachment:download.jpg)

In [None]:
# from sklearn.preprocessing import LabelEncoder
# encoder = LabelEncoder()
# df[column] = encoder.fit_transform(df[column])

**One-Hot Encoding:**

- One-hot encoding is used to represent categorical variables as binary vectors.
- It creates a new binary feature (0 or 1) for each category in the original variable.
- Each observation gets a 1 in the corresponding category and 0s in all other categories.
- Useful for categorical variables without ordinal relationships, as it avoids introducing misleading ordinal information.

![download.png](attachment:download.png)

In [None]:
# from sklearn.preprocessing import OneHotEncoder
# encoder = OneHotEncoder()
# X = onehotencoder.fit_transform(data[column].values.reshape(-1,1)).toarray()
# dfOneHot = pd.DataFrame(X, columns = [str(int(i)) for i in range(X.shape[1])]) 
# new_df = pd.concat([df, dfOneHot], axis=1)
# df= df.drop(['Country'], axis=1) 

#### **3. Splitting**

**Train-Test Split:**

- Split the dataset into two parts: a training set and a test set.
- The training set is used to train the model, and the test set is used to evaluate its performance.
- Common split ratios are 70-30, 80-20, or 90-10, depending on the dataset size.

![download.png](attachment:download.png)

In [None]:
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

**K-Fold Cross-Validation:**

- Divide the dataset into "k" equal-sized folds or subsets.
- Train and validate the model k times, each time using a different fold as the validation set and the remaining folds as the training set.
- Calculate performance metrics by averaging the results from all iterations.
- Common values for "k" are 5 and 10, but it can vary depending on the dataset size.

![download.png](attachment:download.png)

In [None]:
# from sklearn.model_selection import cross_val_score, KFold

# num_folds = 5
# kfold = KFold(n_splits=num_folds, shuffle=True, random_state=42)
# cv_results = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

# for i, accuracy in enumerate(cv_results, 1):
#     print(accuracy)

# print(cv_results.mean())

-----------------------------
-----------------------------

# **Logistic Regression**

![logistic regression.png](<attachment:logistic regression.png>)

- 

- 

#### **Linear equation**

![linear equation.jpg](<attachment:linear equation.jpg>)

- 

- 

![83.jpg](attachment:83.jpg)

- 

- 

#### https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [1]:
# from sklearn.linear_model import LogisticRegression
# model = LogisticRegression(penalty='l2', C=1.0)
# model.fit(X_train, y_train)

# print(model.score(X_train, y_train))

# y_pred = model.predict(X_test)

# from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# print(accuracy_score(y_test, y_pred))
# print(confusion_matrix(y_test, y_pred))
# print(classification_report(y_test, y_pred))

------------------------------
-----------------------------

# **Metrics**

#### **1. Regression**

- **Mean Absolute Error (MAE):**
    - MAE calculates the average absolute differences between predicted and actual values.
    - It is less sensitive to outliers compared to the mean squared error.

    ![download.png](attachment:download.png)

- **Mean Squared Error (MSE):**
    - MSE calculates the average of the squared differences between predicted and actual values.
    - It penalizes larger errors more heavily than MAE.

    ![download.webp](attachment:download.webp)

- **Root Mean Squared Error (RMSE):**
    - RMSE is the square root of MSE.
    - It provides a measure of the average magnitude of errors in the same unit as the target variable.

    ![rmse.png](attachment:rmse.png)

- **R-squared (R2) Score:**
    - R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables.
    - It ranges from 0 to 1, where higher values indicate a better fit.
    - An R2 score of 1 indicates a perfect fit, while 0 means the model performs no better than a horizontal line.

    - Sum of Squared Residuals (SSR) / Total Sum of Squares (SST)
    - **SSR** is the sum of squared residuals, which represents the difference between the actual values of the dependent variable and the predicted values from the regression model.
    - **SST** is the total sum of squares, which represents the difference between the actual values of the dependent variable and the mean of the dependent variable.
​

    
    ![r sqared.png](<attachment:r sqared.png>)

#### **2. Classification**

![cm.png](attachment:cm.png)

**Accuracy:**
- Accuracy measures the proportion of correctly classified instances out of the total instances.
- It's suitable for balanced datasets but can be misleading for imbalanced datasets.

    ![2.-Accuracy-formula-machine-learning-algorithms.png](attachment:2.-Accuracy-formula-machine-learning-algorithms.png)

**Precision:**
- Precision calculates the proportion of true positive predictions among all positive predictions made by the model.
- It focuses on minimizing false positives.
- Precision = TP / (TP + FP)

    ![Precision-formula.png](attachment:Precision-formula.png)

**Recall (Sensitivity or True Positive Rate):**
- Recall calculates the proportion of true positive predictions among all actual positive instances in the dataset.
- It focuses on minimizing false negatives.
- Recall = TP / (TP + FN)

    ![Recall_1.png](attachment:Recall_1.png)

**Specificity (True Negative Rate):**
- Specificity calculates the proportion of true negative predictions among all actual negative instances in the dataset.
- Specificity = TN / (TN + FP)

    ![Specificity-equation.jpg](attachment:Specificity-equation.jpg)

**F1-Score:**
- The F1-Score is the harmonic mean of precision and recall.
- It provides a balance between precision and recall and is useful when there's an imbalance between classes.
- F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

    ![F1-Score.png](attachment:F1-Score.png)

----------------------------
----------------------------