<a href="https://colab.research.google.com/github/tushar2411/Notes/blob/main/machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is meant by machine learning?

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

Mean, Median, Mode:

Mean: The arithmetic mean is the sum of all data points divided by the number of data points.

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])
mean = np.mean(data)
print(mean)


Output: 3.0

Median: The median is the middle value of a dataset when it is ordered from smallest to largest.

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])
median = np.median(data)
print(median)


Output: 3.0

Mode: The mode is the value that appears most frequently in a dataset.

In [None]:
import statistics

data = [1, 2, 3, 3, 4, 5]
mode = statistics.mode(data)
print(mode)


Output: 1.41421356

Percentile: The percentile is the value below which a given percentage of observations in a dataset falls.

In [None]:
import numpy as np

data = np.array([1, 2, 3, 4, 5])
percentile_75 = np.percentile(data, 75)
print(percentile_75)


Output: 4.0

Data Distribution: Data distribution describes how the values in a dataset are spread out or distributed.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

data = np.random.normal(0, 1, 1000) # generate 1000 random numbers from a normal distribution
plt.hist(data, bins=50) # plot a histogram with 50 bins
plt.show()


Output: A histogram plot showing the distribution of random numbers generated from a normal distribution.

Normal Data Distribution: A normal data distribution is a type of data distribution where the values in a dataset are symmetrically distributed around the mean.

Scatter Plot: 
A scatter plot is a type of plot used to display the relationship between two variables.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

x = np.random.normal(0, 1, 100) # generate 100 random numbers from a normal distribution
y = 2 * x + np.random.normal(0, 1, 100) # create a second variable that is linearly related to the first variable, with some added noise
plt.scatter(x, y) #


Linear Regression: 

Linear regression is a technique used to model the relationship between a dependent variable and one or more independent variables.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

x = np.random.normal(0, 1, 100) # generate 100 random numbers from a normal distribution
y = 2 * x + np.random.normal(0, 1, 100) # create a second variable that is linearly related to the first variable, with some added noise
model = LinearRegression().fit(x.reshape(-1,1), y) # fit a linear regression model to the data
y_pred = model.predict(x.reshape(-1,1)) # use the model to make predictions
plt.scatter(x, y) # plot the original data
plt.plot(x, y_pred, color='red') # plot the predicted values
plt.show()


Output: A scatter plot of the original data and a line representing the predicted values from a linear regression model.

Polynomial Regression: 


Polynomial regression is a technique used to model the relationship between a dependent variable and one or more independent variables using a polynomial function.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

x = np.random.normal(0, 1, 100) # generate 100 random numbers from a normal distribution
y = 2 * x + x**2 + np.random.normal(0, 1, 100) # create a second variable that is related to the first variable by a quadratic function, with some added noise
poly = PolynomialFeatures(degree=2) # create a polynomial feature transformer with degree 2
x_poly = poly.fit_transform(x.reshape(-1,1)) # transform the original data into a polynomial feature set
model = LinearRegression().fit(x_poly, y) # fit a linear regression model to the polynomial feature set
y_pred = model.predict(x_poly) # use the model to make predictions
plt.scatter(x, y) # plot the original data
plt.plot(x, y_pred, color='red') # plot the predicted values
plt.show()


Output: A scatter plot of the original data and a curve representing the predicted values from a polynomial regression model.

Multiple Regression: 


Multiple regression is a technique used to model the relationship between a dependent variable and two or more independent variables.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

x1 = np.random.normal(0, 1, 100) # generate 100 random numbers from a normal distribution for the first independent variable
x2 = np.random.normal(0, 1, 100) # generate 100 random numbers from a normal distribution for the second independent variable
y = 2 * x1 + 3 * x2 + np.random.normal(0, 1, 100) # create a dependent variable that is linearly related to the independent variables, with some added noise
X = np.column_stack((x1, x2)) # stack the independent variables horizontally to create a feature set
model = LinearRegression().fit(X, y) # fit a linear regression model to the data
y_pred = model.predict(X) # use the model to make predictions
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(x1, x2, y) # plot the original data in 3D
ax.plot_trisurf(x1, x2, y_pred, color='red', alpha=0.5) # plot the predicted values as a surface in 3


Scale:

 Scale refers to the range of values of a variable, and it can be measured using various methods such as nominal, ordinal, interval, and ratio scales.

Train/Test: 

Train/test is a technique used to evaluate the performance of a machine learning model. The data is split into a training set and a testing set, and the model is trained on the training set and evaluated on the testing set to see how well it generalizes to new data.

Decision Tree:

 A decision tree is a hierarchical model that partitions the data into subsets based on the values of the independent variables, and it is often used for classification and regression tasks.

In [None]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree

iris = load_iris()
X = iris.data
y = iris.target
clf = DecisionTreeClassifier()
clf.fit(X, y)
plot_tree(clf)


Output: A visualization of the decision tree model that was fitted to the iris dataset.

Confusion Matrix: 

A confusion matrix is a table that shows the true positives, true negatives, false positives, and false negatives of a classification model, and it is often used to evaluate the performance of the model.

In [None]:
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix

X, y = make_classification(n_samples=100, n_features=3, n_classes=2, random_state=42)
y_pred = [0 if x[0] < 0 else 1 for x in X] # make random predictions
cm = confusion_matrix(y, y_pred) # calculate the confusion matrix
print(cm)


Output: A scatter plot of the random data points, with different colors indicating the two clusters that were identified by the hierarchical clustering model.

Logistic Regression: 


Logistic regression is a technique used to model the probability of a binary outcome (e.g., 0 or 1) based on one or more independent variables, and it is often used for classification tasks.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression

x = np.random.normal(0, 1, 100) # generate 100 random numbers from a normal distribution
y = np.where(x > 0, 1, 0) # create a binary outcome variable based on the values of the independent variable
model = LogisticRegression().fit(x.reshape(-1,1), y) # fit a logistic regression model to the data
y_pred = model.predict_proba(x.reshape(-1,1))[:,1] # use the model to predict the probability of the positive outcome
plt.scatter(x, y) # plot the original data
plt.plot(x, y_pred, color='red') # plot the predicted probabilities
plt.show()


Grid Search:


 Grid search is a technique used to find the optimal hyperparameters for a machine learning model by exhaustively searching through a specified set of hyperparameters and evaluating the performance of the model for each combination.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

iris = load_iris()
X = iris.data
y = iris.target
parameters = {'kernel': ('linear', 'rbf'), 'C': [1, 10]} # define the hyperparameter grid to search over
clf = SVC()
grid_search = GridSearchCV(clf, parameters)
grid_search.fit(X, y) # perform the grid search
print(grid_search.best_params_) # print the best hyperparameters that were found


Output: The best hyperparameters that were found for the support vector machine (SVM) model when applied to the iris dataset.

Categorical Data: 


Categorical data refers to data that takes on discrete values, such as gender (male/female) or education level (high school/college/graduate).

In [None]:
import pandas as pd

df = pd.DataFrame({'gender': ['male', 'male', 'female', 'male', 'female'], 'education': ['college', 'high school', 'graduate', 'graduate', 'high school']}) # create a dataframe with two categorical variables
df = pd.get_dummies(df) # use one-hot encoding to convert the categorical variables to binary variables
print(df)


Output: A dataframe with two categorical variables that have been converted to binary variables using one-hot encoding.

K-means:


 K-means is a clustering technique used to group similar data points together based on their distances from a specified number of centroids, and it is often used for exploratory data analysis and pattern recognition.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

X = np.random.normal(0, 1, (100, 2)) # generate 100 random 2D points from a normal distribution
model = KMeans(n_clusters=2).fit(X) # fit a K-means clustering model to the data with 2 clusters
plt.scatter(X[:,0], X[:,1], c=model.labels_) # plot the data with different colors for each cluster
plt.show()


Output: A scatter plot of the random data points, with different colors indicating the two clusters that were identified by the K-means clustering model.

Bootstrap Aggregation: 


Bootstrap aggregation (or bagging) is a technique used to improve the stability and accuracy of a machine learning model by training multiple models on different subsets of the data and combining their predictions.

In [None]:
from sklearn.datasets import make_classification
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

X, y = make_classification(n_samples=100, n_features=3, n_classes=2, random_state=42)
clf = DecisionTreeClassifier()
model = BaggingClassifier(base_estimator=clf, n_estimators=10)
model.fit(X, y) # fit a bagged decision tree model to the data
print(model.predict(X[:10])) # print the predicted classes for the first 10 data points


Output: The predicted classes for the first 10 data points using the bagged decision tree model.

Cross Validation:


 Cross validation is a technique used to estimate the performance of a machine learning model by repeatedly splitting the data into training and testing sets, and averaging the results to get a more accurate estimate of the model's performance.