# FEATURE ENGINEERING:

###  Feature Engineering :
1. Introduction
2. Why is it used
3. Why do we begin with this step

###   Null Values :
1. What are null values
2. How to identify them
3. How to handle null values
4. Implementation

###  Outliers :
1. What are outliers
2. How can we detect them
3. How to handle the outliers
4. Implementation

###  Preprocessing :

1. What is preprocessing
2. What are the steps for preprocessing
3. What are the functions for preprocessing
4. Implementation

### Transformation :

1. what is transformation
2. what are the types of transformation
3. Implementation



###  Scaling :

1. what is scaling
2. why are we scaling
3. what are types of scaling
4. Implementation




##  Introduction to feature engineering :

- Feature Engineering is the process of extracting and organizing the important features from raw data in such a way that it fits the purpose of the machine learning model.

- It can be thought of as the art of selecting the important features and transforming them into refined and meaningful features that suit the needs of the model.

- Feature Engineering encapsulates various data engineering techniques such as selecting relevant features, handling missing data, encoding the data, and normalizing it.

- It is one of the most crucial tasks and plays a major role in determining the outcome of a model.

- In order to ensure that the chosen algorithm can perform to its optimum capability, it is important to engineer the features of the input data effectively.



## Why is Feature Engineering so important?

- You must know what takes the maximum amount of time and effort in a Machine Learning workflow.
- Getting to the modelling stage will take up to 80% of the time of a data scientist.

- This is where Feature Engineering comes into play.
- After the data is cleaned and processed it is then ready to be fed into the machine learning models to train and generate outputs.

- Feature engineering is focused on using the variables you already have to create additional features that are (hopefully) better at representing the underlying structure of your data.

- Feature engineering is a creative process that relies heavily on domain knowledge and the thorough exploration of your data. - - But before we go any further, we need to step back and answer an important question.

- A feature is not just any variable in a dataset.
- A feature is a variable that is important for predicting your specific target and addressing your specific question(s).

- For example, a variable that is commonly found in many datasets is some form of unique identifier.
- This identifier may represent a unique individual, building, transaction, etc.

- Unique identifiers are very useful because they allow you to filter out duplicate entries or merge data tables, but unique IDs are not useful predictors.
- We wouldn’t include such a variable in our model because it would instantly overfit to the training data without providing any useful information for predicting future observations.

- Creating additional features that better emphasize the trends in your data has the potential to boost model performance.
- After all, the quality of any model is limited by the quality of the data you feed into it.

- Just because the information is technically already in your dataset does not mean a machine learning algorithm will be able to pick up on it.

- Important information can get lost amidst the noise and competing signals in a large feature space.
- Thus, in some ways, feature engineering is like trying to tell a model what aspects of the data are worth focusing on.

- This is where your domain knowledge and creativity as a data scientist can really shine




#### So far we have established that Feature Engineering is an extremely important part of a Machine Learning Pipeline, but why is it needed in the first place?

-  In most cases, Data Scientists deal with data extracted from massive open data sources such as the internet, surveys, or reviews.

- This data is crude and is known as raw data.
- It may contain missing values, unstructured data, incorrect inputs, and outliers.

- If we directly use this raw, un-processed data to train our models, we will land up with a model having a very poor efficiency.

- Thus Feature Engineering plays an extremely pivotal role in determining the performance of any machine learning model.

- Analyzing The Dataset Features iis very important,
- Whenever you get your hands on a dataset, you must first spend some time analyzing it.

- This will help you get an understanding of the type of features and data you are dealing with.
- Analyzing the dataset will also help you create a mind map of the feature engineering techniques that you will need to process your data.

- Now our dataset is feature engineered and all ready to be fed into a Machine Learning model.
- This dataset can now be used to train the model to make the desired predictions.
- We have effectively engineered all our features.
- The missing values have been handled, the categorical variables have been effectively encoded and the features have been scaled to a uniform scale.

- Rest assured, now we can safely sit back and wait for our data to generate some amazing results!

- Once you have effectively feature engineered all the variables in your dataset, you can be sure to generate models having the best possible efficiency as all the algorithms can now perform to their optimum capabilities.

- Feature engineering, like so many things in data science, is an iterative process. Investigating, experimenting, and doubling back to make adjustments are crucial.
- The insights you stand to gain into the structure of your data and the potential improvements to model performance are usually well worth the effort.
- Plus, if you’re relatively new to all this, feature engineering is a great way to practice working with and manipulating DataFrames! So stay tuned for future posts covering specific examples (with code) of how to do just that.

- That is the all about Feature Engineering!


# Null Values :

- A null value is a placeholder that indicates the absence of a value.
- Null values exist for all data types. The null value of a given data type is different from all non-null values of the same data type.
- By default, any column can contain null values.
- You can use either the NOT NULL or the CHECK parameter in a column definition to disallow null values in the column.

## How You Specify a Null Value

- You use the keyword NULL to indicate a null value.
- For example, the following INSERT statement inserts a new row into the DEPARTMENT table.
- The department number and name and the division code are known, but the department head has not been appointed yet.
- A null value is used as a placeholder in the DEPT_HEAD_ID column.

- Missing values are a common issue in machine learning.
- This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models.
- It is essential to address missing values efficiently to ensure strong and impartial results in your machine-learning projects.





## What is a Missing Value?

- Missing values are data points that are absent for a specific variable in a dataset.
- They can be represented in various ways, such as blank cells, null values, or special symbols like “NA” or “unknown.”
- These missing data points pose a significant challenge in data analysis and can lead to inaccurate or biased results.

### Methods for Identifying Missing Data

- Locating and understanding patterns of missingness in the dataset is an important step in addressing its impact on analysis. - - Working with Missing Data in Pandas there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame.



## Functions that can be used to identify the null values :

##### .isnull()

- Identifies missing values in a Series or DataFrame.

##### .notnull()

- check for missing values in a pandas Series or DataFrame. It returns a boolean Series or DataFrame, where True indicates non-missing values and False indicates missing values.

##### .info()

- Displays information about the DataFrame, including data types, memory usage, and presence of missing values.

##### .isna()

- similar to notnull() but returns True for missing values and False for non-missing values.

- dropna()	Drops rows or columns containing missing values based on custom criteria.
- fillna()	Fills missing values with specific values, means, medians, or other calculated values.
- replace()	Replaces specific values with other values, facilitating data correction and standardization.
- drop_duplicates()	Removes duplicate rows based on specified columns.
- unique()	Finds unique values in a Series or DataFrame.


