Data-Science-Assignments

This Contains my notes/study material releated to the topics and Python code of all the machine learning algorithms.

Index

1 assignment - Basic statistics(1)

2 assignment - Basic statistics(2)

3 assignment - Hypothesis Testing

4 assignment - Simple Linear Regression

5 assignment - Multiple Linear Regression

6 assignment - Logistic Regression

7 assignment - Clustering

8 assignment - PCA

9 assignment - Association Rules

10 assignment - Recommendation Engine

11 assignment - Text Mining

12 assignment - Naive Bayes

13 assignment - KNN

14 assignment - Decision Tree

15 assignment - Random Forest

16 assignment - Neural Network

17 assignment - SVM

18 assignment - Forecasting

Training and Testing Data

Its good practice to first randomly sort and the data then split into two parts. 80% of data for training the model and remaining 20% of the data for testing the model.
The reason we don't use same training set for testing is because our model has seen those samples before, using same samples for making predictions might give us wrong impression about accuracy of our model.
Here we are going to use sklearn.model_selection.train_test_split method.

Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable.
Binary classficiation: When outcome has only two categories. (yea/no, 0/1, buy/not buy) e.g. Predicting whether customer will buy insurance policy.
Multiclass classification: When outcome has more than two categoirs. e.g Which party a person is going to vote for (BJP, Congres, AAP).

Decisions Trees

Decisions trees are the most powerful algorithms that falls under the category of supervised algorithms.
Unlike other supervised learning algorihms decision tree can be used to solve regression and classification problems.
The goal of decision tree is to create training model that can predict class or value by learning simple decision rules from training data.
The two main entities of a tree are decision nodes, where the data is split and leaves, where we got outcome.
Decision tree algorithm forms the tree, based on 'High Information Gain'. Lower the entropy higher the information gain.
Entropy: It is basically a measure of randomness in your sample. So if there is no randomness in your sample then then entropy is low.
Types of decision tree a. Classification decision trees − In this kind of decision trees, the decision variable is categorical. The above decision tree is an example of classification decision tree. b. Regression decision trees − In this kind of decision trees, the decision variable is continuous.

Support Vector Machine (SVM)

SVM algorithm is preferred by many as it provide more accurate results with less computational power.
SVM are mostly used for classification tasks but can also be used for regression tasks as well.
SVM is suited for extreame cases( where difference between feature is very small. e.g. cat which groomed like a Dog).
So the SVM will looks at the extreame points in dataset and draws a boundary (line incase of 2D and hyperplane for more 2D) between those extreame points to separate the features. Which results in best possible segregation of classes.
Suport vectors are the data points which are close to the opposing class. SO SVM actually only consider these support vectors for defining the classification boundary and ignore's the other training examples.
e.g. suppose we have a dataset of dogs and cats. In that dataset there is a dog that looks like a cat and a cat thats is groomed like a dog. So our SVM algorithm will use these two extreame examples as support vectors and draws boundary to classify the dogs and cats classes. Since this boundary is based on extream examples(support vector) it will takes care of other training examples as well.
SVM will use multiple such support vectors to classify dataset and increase the margin between to classes .

SVM parameters

Gamma: In case of high value of Gamma decision boundary is dependent of points cloase it where in case of low value of Gamma decision SVM will consider the far away points also while deciding the decision boundary .
Regularization parameter(C): Large C will result in overfitting and which will lead to lower bias and high variance. Small C will result in underfitting and which will lead to higher bias and low variance .

References

Random Forest

Random forest is supervised learning algoriothm which is used for classification as well as regression. However it is mostly used for classfication problems .
In Random forest algorithm dataset is devided in multiple batches and using 'Decision Tree' algorithm, gets prediction for each batch and then choose the best solution based on voting.
We can understand the working of Random Forest algorithm with the help of following steps: 1. Step 1 − First, start with the selection of random samples from a given dataset. 2. Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree. 3. Step 3 − In this step, voting will be performed for every predicted result. 4. Step 4 − At last, select the most voted prediction result as the final prediction result.

Pros

It overcomes the problem of overfitting by averaging or combining the results of different decision trees.
Random forests work well for a large range of data items than a single decision tree does.
Random forest has less variance then single decision tree.
Random forests are very flexible and possess very high accuracy.
Scaling of data does not require in random forest algorithm. It maintains good accuracy even after providing data without scaling.
Random Forest algorithms maintains good accuracy even a large proportion of the data is missing.

Cons

Complexity is the main disadvantage of Random forest algorithms.
Construction of Random forests are much harder and time-consuming than decision trees.
More computational resources are required to implement Random Forest algorithm.
It is less intuitive in case when we have a large collection of decision trees.
The prediction process using random forests is very time-consuming in comparison with other algorithms.

K Means Clustering

K Means is unsupervised learning algorith. It is used to find the clusters of data in unlabelled data.
K = No of principal componenet or no of clusters.

How K Means Algorith works

First steps is to randomly initialize two points and call them centroids .
No of centroids should be equal to no of clusters you want to predict.
Now in 'assignment steps' K Means algorithm will go through each of the data points and depending on its closeness to the cluster it will assign the data points to a cluster.
During 'assignment' if there is any centroid who has no data point associated with it, then it can be removed.
Now in 'move' step K means algorithm will find the mean of each data point assigned to the cluster centroid and move the respective centroid to the mean value location.
Now alogorith will keep doing the 'assigment' and 'move' steps till the convergance.

Choosing no of clusters (k)

Mostly K value choosen mannually
Elbow Method

Naive Bayes Algorithm

Basic Probability

Probability of getting head/tail when you flip a coin is 0.5 i.e. 50% .
Similarly probability of getting queen from a deck of card is 4/52 i.e. 7.7 %

Consitional Probability

Unlike basic probability in conditional probability we know that the event A has occured and we are trying to predict the probability of B. i.e. What is probability of getting a queen of diamond. Here card type diamond is event A.
So the consitional probability of getting a queen of diamond is represented as P(Queen/Diamond) = 1/13 i.e. 7.7%
More general representation is P(A/B) = Probability of 'Event A' knowing that 'Event B' has already occured .
Thomas Bayes conditional probability equation is: P(A/B) = ( P(B/A) * P(A) ) / P(B)

Naive Bayes

So using Bayes conditional probability equation we can find the probability of certain events based on probability of some knwon events.
Its called 'Naive' because it assumes the known events(features) are independent of each other. This makes our calculations little simpler .

Naive Base Classifiers

Bernoulli Naive Bayes

It Assumes that all our features are binary, means they take only two values 0 and 1 .
e.g. 1 can represent spam mails where 0 can represent ham mails .

Multinomial Naive Bayes

It is used when we have descrete data e.g. Movie rating from 1 to 5 as each rating will have certain frequency to represent .

Gaussian Naive Bayes

Because of the assumtion of nominal distributions(bell curve) Gaussian Naive Bayes is used when all the features are continous .
E.g. IRIS flower dataset features(sepal width, sepal length, petal width, patal length) are continuous. We can t represent these features in terms of their occurance which means data is continuous .

Where its used

Email spam detection
Handwritten digit recognition
Weather prediction
Face detection
News article categorization

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
01-basic statistics		01-basic statistics
02- basic stats		02- basic stats
03 - Hypothesis Testing		03 - Hypothesis Testing
04- Simple Linear Regression		04- Simple Linear Regression
05- Multiple Linear Regression		05- Multiple Linear Regression
06- Logistic Regression		06- Logistic Regression
07- Clustering		07- Clustering
08- PCA		08- PCA
09- Association Rules		09- Association Rules
10- Recommendation Engine		10- Recommendation Engine
11- Text Mining		11- Text Mining
12- Naive Bayes		12- Naive Bayes
13- KNN		13- KNN
14- Decision Tree		14- Decision Tree
15- Random Forest		15- Random Forest
16- Neural Network		16- Neural Network
17- SVM		17- SVM
18- Forecasting		18- Forecasting
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Science-Assignments

This Contains my notes/study material releated to the topics and Python code of all the machine learning algorithms.

Index

Training and Testing Data

Logistic Regression

Decisions Trees

Support Vector Machine (SVM)

SVM parameters

References

Random Forest

Pros

Cons

K Means Clustering

How K Means Algorith works

Choosing no of clusters (k)

Naive Bayes Algorithm

Basic Probability

Consitional Probability

Naive Bayes

Naive Base Classifiers

Bernoulli Naive Bayes

Multinomial Naive Bayes

Gaussian Naive Bayes

Where its used

About

Releases

Packages

Languages

vaibhavhindia/data-science-assignments

Folders and files

Latest commit

History

Repository files navigation

Data-Science-Assignments

This Contains my notes/study material releated to the topics and Python code of all the machine learning algorithms.

Index

Training and Testing Data

Logistic Regression

Decisions Trees

Support Vector Machine (SVM)

SVM parameters

References

Random Forest

Pros

Cons

K Means Clustering

How K Means Algorith works

Choosing no of clusters (k)

Naive Bayes Algorithm

Basic Probability

Consitional Probability

Naive Bayes

Naive Base Classifiers

Bernoulli Naive Bayes

Multinomial Naive Bayes

Gaussian Naive Bayes

Where its used

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages