<b>What is Data Scrubbing?</b>

Data Scrubbing is refining your dataset to make it more workable.
It is achieved by modifying or sometimes removing incomplete, incorrectly, formatted, irrelevant/duplicate data items.

<b>What is Feature Selection?</b>

Feature Selection is choosing features wisely.
Let's say we have 4 features in a dataset. If we wish to plot it so as to understand the dataset it's really difficult to understand a 4D plot whereas if we select two highly positive correlated features from the 4 given features we can plot a 2D plot which is easier to understand and easier to interpret.
Preserving features that do not correlate strongly with the outcome value can manipulate as well as decrease the accuracy of the model.

Now, I will be explaining you about some methods which you can use to make your dataset more precise, understandable, workable and easier to work on.

 1. Row Compression

Animal &emsp;&emsp;  Meat Eater &emsp;&emsp;        Legs   &emsp; &emsp;        Tail      &emsp; &emsp;    Race Time

Tiger   &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Yes   &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  4   &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  Yes  &emsp;&emsp;&nbsp;         7:02 minutes
<br>
Lion    &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Yes   &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4     &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Yes    &emsp;&emsp;&nbsp;        7:06 minutes
<br>
Tortoise  &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      No      &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  4       &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Yes      &emsp;&emsp;&nbsp;      45:01 minutes

For example - if we have the above data, we can easily merge tiger and lion and group them as carnivores and we can easily calculate the average of the Race Time. By doing this we obtained our dataset like this.

Animal &emsp;&emsp;  Meat Eater &emsp;&emsp;        Legs   &emsp; &emsp;        Tail      &emsp; &emsp;    Race Time

Carnivore   &emsp;&emsp;&nbsp;&nbsp;&nbsp;Yes   &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  4   &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  Yes  &emsp;&emsp;&nbsp;         7:04 minutes
<br>
Tortoise  &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;      No      &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  4       &emsp;&emsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Yes      &emsp;&emsp;&nbsp;      45:01 minutes

Here we see that we have minimised the rows so as to get a clean data which is more precise.<br>
<b>Hence, we should always look our dataset very carefully so we can merge any rows and eliminate any unnecessary column so as to make our data easier to interpret.</b>

For example - If we have the same data as above, and we have an extra column "date". Question yourself is this an important feature or should I drop it? <br>
However, there are datasets where date plays a major role. Hence, you should always look your dataset very carefully.

 2. One-hot Encoding

One-hot encoding is a process to convert text-based date into numeric data. <br>
Aside from set text-based values such as True/False, most algorithms are not compatible with nonnumeric data.
One means to convert text-based values into numeric values is one-hot encoding, which transforms values into binary form, represented as “1” or “0”—“True” or “False.” A “0,” representing False, means that the value does not belong to this particular feature, whereas a “1”—True or “hot”—
confirms that the value does belong to this feature.

For example - If you have a dataset where you have gender as Male,Female and others, so you can set Male as "0",Female as "1" and Other as "2". We can also transform a Yes/No data into "0" and "1".

<b>We can implement One-Hot Encoding as follows:</b>

Ty=[] &emsp;&emsp;//Here we made an empty list for storing the values <br>
for i in df['gender']:&emsp;&emsp;//Firstly we have for-loop here. Here df is the name of the variable which reads the csv file of the dataset and gender is the column on which we &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;need to apply One-hot encoding method.<br>
&emsp;&emsp;    if i=='Male':<br>
        &emsp;&emsp;&emsp;&emsp; Ty.append('0') &emsp;&emsp; //if the gender is male we are replacing that value with number 0<br>
&emsp;&emsp;    elif i=='Female':<br>
        &emsp;&emsp;&emsp;&emsp; Ty.append('1') &emsp;&emsp; //if the gender is female we are replacing that value with number 1<br>
&emsp;&emsp;    if i=='Other':<br>
        &emsp;&emsp;&emsp;&emsp; Ty.append('2') &emsp;&emsp; //if the gender is other we are replacing that value with number 2<br>
df['TY']=Ty &emsp;&emsp; //Finally we are adding a new column to our dataset that stores the list values of the text-based data <br>

We can also make new column each depending upon the categorical value we have. In the above example if we can have a new column as Male and place 1 when we have the gender as male, Female column and place 1 when we have the gender as female and Other column and pace 1 when we have the gender as Other. 

We do it by calling get_dummies method. Its code is mentioned below:<br>
<b>pd.get_dummies(df.gender)<br>
dummies<br></b>
Here df is the name of the variable which reads the csv file of the dataset and gender is the column. Here the output will be three columns Male,female and Other with values as  "0" and "1". "1" for the presence of that attribute and "0" for the absence of the attribute.

And now we want to merge the three columns with our dataset and its code is mentioned below:<br>
<b>merged=pd.concat([df,dummies],axis='columns')<br>
merged<br></b>
Here we are concatenating(adding) the three columns in our original dataset and axis defines how we are adding them i.e column wise. 

And then we can drop the gender column and we can drop any of the three columns as well.<br>
Now, if you have a question why are we dropping one of the gender column? The answer is pretty simple if Male and Female has "0" value then that simply means that there gender is Other. Hence we can drop any of the three values so as to clean our dataset.<br>
The code for the same is given below:<br>
<b>final=merged.drop(['gender','Other'],axis='columns')<br>
final<br></b>
Here, we have dropped the gender and Other column and axis defines how we are dropping them i.e column wise. 

3. Binning

Binning is another method which is used to convert numeric values into a category. <br>
But, why do we need to convert numeric values into a category? In most cases numeric values are preferred as they are compatible with a broader selection of algorithms. Where numeric values are not ideal, is in situations where they list variations irrelevant to the goals of your analysis. Let’s take house price evaluation as an example. The exact measurements of a badminton court might not matter greatly when evaluating house prices, the relevant information is whether the house has a badminton court. <br>
This logic probably also applies to the garage and the swimming pool, where the existence or non-existence of the variable is generally more influential than their specific measurements. The solution here is to replace the numeric measurements of the badminton court with a True/False or Yes/No feature or a categorical value such as “small,” “medium,” and “large.” Another alternative would be to apply one-hot encoding with “0” for homes that do not have a badminton court and “1” for homes that do have a badminton court.

4. Missing Data

Dealing with missing data is never a desired situation. Imagine unpacking a deck of card with five percent of the cards missing. Can you play with that deck? No. Missing values in your dataset can be equally frustrating and interfere with your analysis and the model’s predictions. There are, however, strategies to minimize the negative impact of missing data. <br>
<b>First approach</b> is to approximate missing values using the mode value. The mode represents the single most common variable value available in the dataset. This works best with categorical and binary variable types, such as one to five-star rating systems and positive/negative drug tests respectively.<br>
The <b>second approach</b> is to approximate missing values using the median value, which adopts the value/values located in the middle of the dataset. This works best with continuous variables, which have an infinite number of possible values, such as house prices. <br>
As a <b>last way</b>, rows with missing values can be removed altogether. The obvious downside to this approach is having less data to analyze and potentially less comprehensive insight.