**Machine Learning Studies**

Machine learning is a field of computer science that aims to teach computers how to learn and act without being explicitly programmed. More specifically, machine learning is an approach to data analysis that involves building and adapting models, which allow programs to "learn" through experience. Machine learning involves the construction of algorithms that adapt their models to improve their ability to make predictions.


**Supervised and Unsupervised Learning**

In a supervised learning model, the algorithm learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data. An unsupervised model, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.

![q2.png](attachment:q2.png)

**Test and validation set**

A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning modelâ€™s hyperparameters.

The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

**Data Preprocessing**

Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format. 

**Pre-processing Steps**
* Data Quality Assessment
* Feature Aggregation
* Feature Sampling
* Dimensionality Reduction
* Feature Encoding

**Data Quality Assessment**
* **1.Missing values :**

Eliminate rows with missing data :
Fails if many objects have missing values. If a feature has mostly missing values, then that feature itself can also be eliminated.

Estimate missing values :
If only a reasonable percentage of values are missing, then we can also run simple interpolation methods to fill in those values. However, most common method of dealing with missing values is by filling them in with the mean, median or mode value of the respective feature.

* **2. Inconsistent values :**
Performs data assessment like knowing what the data type of the features should be and whether it is the same for all the data objects.

* **3. Duplicate values :**
In most cases, the duplicates are removed so as to not give that particular data object an advantage or bias, when running machine learning algorithms.

**Feature Aggregation**

Feature Aggregations are performed so as to take the aggregated values in order to put the data in a better perspective. 

* This results in reduction of memory consumption and processing time

* Aggregations provide us with a high-level view of the data as the behaviour of groups or aggregates is more stable than individual data objects

**Feature Sampling**

Using a sampling algorithm can help us reduce the size of the dataset to a point where we can use a better, but more expensive, machine learning algorithm.

The key principle here is that the sampling should be done in such a manner that the sample generated should have approximately the same properties as the original dataset, meaning that the sample is representative. This involves choosing the correct sample size and sampling strategy.

Simple Random Sampling dictates that there is an equal probability of selecting any particular entity. It has two main variations as well :

* Sampling without Replacement : As each item is selected, it is removed from the set of all the objects that form the total dataset.
* Sampling with Replacement : Items are not removed from the total dataset after getting selected. This means they can get selected more than once.

Although Simple Random Sampling provides two great sampling techniques, it can fail to output a representative sample when the dataset includes object types which vary drastically in ratio. This can cause problems when the sample needs to have a proper representation of all object types, for example, when we have an *imbalanced* dataset.

In these cases, there is another sampling technique which we can use, called *Stratified Sampling*, which begins with predefined groups of objects. There are different versions of Stratified Sampling too, with the simplest version suggesting equal number of objects be drawn from all the groups even though the groups are of different sizes.

**Dimensionality Reduction**

*Dimension* refers to the number of geometric planes the dataset lies in, which could be high so much so that it cannot be visualized with pen and paper. More the number of such planes, more is the complexity of the dataset.

What dimension reduction essentially does is that it maps the dataset to a lower-dimensional space, which may very well be to a number of planes which can now be visualized, say 2D. The basic objective of techniques which are used for this purpose is to reduce the dimensionality of a dataset by creating new features which are a combination of the old features. 

A few major benefits of dimensionality reduction are :
* Data Analysis algorithms work better if the dimensionality of the dataset is lower. This is mainly because irrelevant features and noise have now been eliminated.
* The models which are built on top of lower dimensional data are more understandable and explainable.
* The data may now also get easier to visualize! Features can always be taken in pairs or triplets for visualization purposes, which makes more sense if the featureset is not that big.

**Feature Encoding**

Feature encoding is basically performing transformations on the data such that it can be easily accepted as input for machine learning algorithms while still retaining its original meaning.

There are some general norms or rules which are followed when performing feature encoding. For Continuous variables :

* Nominal : Any one-to-one mapping can be done which retains the meaning. For instance, a permutation of values like in One-Hot Encoding.

* Ordinal : An order-preserving change of values. The notion of small, medium and large can be represented equally well with the help of a new function, that is, <new_value = f(old_value)> - For example, {0, 1, 2} or maybe {1, 2, 3}.

![onehotencoding.png](attachment:onehotencoding.png)

For Numeric variables:
* Interval : Simple mathematical transformation like using the equation <new_value = a*old_value + b>, a and b being constants. For example, Fahrenheit and Celsius scales, which differ in their Zero values size of a unit, can be encoded in this manner.

* Ratio : These variables can be scaled to any particular measures, of course while still maintaining the meaning and ratio of their values. Simple mathematical transformations work in this case as well, like the transformation <new_value = a*old_value>. For, length can be measured in meters or feet, money can be taken in different currencies.

**Train / Validation / Test Split**
Before we start deciding the algorithm which should be used, it is always advised to split the dataset into 2 or sometimes 3 parts. Machine Learning algorithms, or any algorithm for that matter, has to be first trained on the data distribution available and then validated and tested, before it can be deployed to deal with real-world data.

**Training data :** This is the part on which your machine learning algorithms are actually trained to build a model. The model tries to learn the dataset and its various characteristics and intricacies, which also raises the issue of Overfitting v/s Underfitting.

**Validation data :** This is the part of the dataset which is used to validate our various model fits. In simpler words, we use validation data to choose and improve our model hyperparameters. The model does not learn the validation set but uses it to get to a better state of hyperparameters.

**Test data :** This part of the dataset is used to test our model hypothesis. It is left untouched and unseen until the model and hyperparameters are decided, and only after that the model is applied on the test data to get an accurate measure of how it would perform when deployed on real-world data.

**Split Ratio :** Data is split as per a split ratio which is highly dependent on the type of model we are building and the dataset itself. If our dataset and model are such that a lot of training is required, then we use a larger chunk of the data just for training purposes (usually the case) â€” For instance, training on textual data, image data, or video data usually involves thousands of features!

If the model has a lot of hyperparameters that can be tuned, then keeping a higher percentage of data for the validation set is advisable. Models with less number of hyperparameters are easy to tune and update, and so we can keep a smaller validation set.

Like many other things in Machine Learning, the split ratio is highly dependent on the problem we are trying to solve and must be decided after taking into account all the various details about the model and the dataset in hand

![train-validation-test.jpg](attachment:train-validation-test.jpg)

**Countionus and Discrete Variables?**


*Discrete data* is information that can only take certain values. These values donâ€™t have to be whole numbers (a child might have a shoe size of 3.5 or a company may make a profit of Â£3456.25 for example) but they are fixed values â€“ a child cannot have a shoe size of 3.72!

The number of each type of treatment a salon needs to schedule for the week, the number of children attending a nursery each day or the profit a business makes each month are all examples of discrete data. This type of data is often represented using tally charts, bar charts or pie charts.

*Continuous data* is data that can take any value. Height, weight, temperature and length are all examples of continuous data. Some continuous data will change over time; the weight of a baby in its first year or the temperature in a room throughout the day. This data is best shown on a line graph as this type of graph can show how the data changes over a given period of time. Other continuous data, such as the heights of a group of children on one particular day, is often grouped into categories to make it easier to interpret.


## Sources

https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d?gi=78b7aa1060bb

https://machinelearningmastery.com/difference-test-validation-datasets/

https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825

https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=85587&section=1