<a href="https://colab.research.google.com/github/swopnimghimire-123123/Machine-Learning-Journey/blob/main/45_feature_construction_and_feature_splitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Feature Engineering: Feature Construction and Feature Scaling

**Feature Construction:**

Feature construction, also known as feature creation or feature generation, is the process of creating new features from existing ones in your dataset. This is a crucial step in machine learning as it can significantly improve the performance of your models. By combining, transforming, or aggregating existing features, you can create new features that capture more relevant information or relationships within the data.

Examples of feature construction techniques include:

*   **Polynomial Features:** Creating new features by raising existing features to a power (e.g., squaring a feature).
*   **Interaction Terms:** Creating new features by multiplying two or more existing features.
*   **Aggregate Features:** Creating new features by aggregating information from related data points (e.g., calculating the average of a feature for a group).
*   **Date and Time Features:** Extracting information from date and time features, such as day of the week, month, year, or time of day.
*   **Domain-Specific Features:** Creating features based on domain knowledge and understanding of the problem.

The goal of feature construction is to engineer features that are more informative and predictive for your machine learning task.

**Feature Scaling:**

Feature scaling is a data preprocessing technique used to standardize the range of independent variables or features in a dataset. Most machine learning algorithms are sensitive to the scale of the features. If features have vastly different ranges, the algorithm might be biased towards features with larger values, leading to suboptimal performance.

Feature scaling helps to:

*   **Improve Algorithm Performance:** Many algorithms, such as gradient descent-based methods (e.g., linear regression, logistic regression, neural networks) and distance-based algorithms (e.g., K-Nearest Neighbors, Support Vector Machines), perform better when features are on a similar scale.
*   **Prevent Dominance of Features:** It prevents features with larger values from dominating the learning process.
*   **Speed up Convergence:** For iterative optimization algorithms, scaling can help the algorithm converge faster to the optimal solution.

Common feature scaling techniques include:

*   **Min-Max Scaling (Normalization):** Scales features to a fixed range, usually between 0 and 1. The formula is: $X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}$
*   **Standardization (Z-score normalization):** Scales features to have zero mean and unit variance. The formula is: $X_{scaled} = \frac{X - \mu}{\sigma}$, where $\mu$ is the mean and $\sigma$ is the standard deviation.

The choice of scaling technique depends on the specific algorithm and the distribution of your data. It's important to apply the same scaling transformation to both your training and testing datasets to avoid data leakage.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

import seaborn as sns

In [3]:
df = pd.read_csv("/content/45_train.csv")[["Age","Pclass","SibSp","Parch","Survived"]]

In [4]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [5]:
df.dropna(inplace=True)

In [7]:
df.shape

(714, 5)

In [8]:
df.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Survived
0,22.0,3,1,0,0
1,38.0,1,1,0,1
2,26.0,3,0,0,1
3,35.0,1,1,0,1
4,35.0,3,0,0,0


In [10]:
X = df.iloc[:,0:4]
y = df.iloc[:,-1]

In [11]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch
0,22.0,3,1,0
1,38.0,1,1,0
2,26.0,3,0,0
3,35.0,1,1,0
4,35.0,3,0,0


In [12]:
np.mean(cross_val_score(LogisticRegression(),X,y,scoring='accuracy',cv=20))

np.float64(0.6933333333333332)

### Applying Feature construction

In [13]:
X['Family_size'] = X['SibSp'] + X['Parch'] + 1

In [14]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size
0,22.0,3,1,0,2
1,38.0,1,1,0,2
2,26.0,3,0,0,1
3,35.0,1,1,0,2
4,35.0,3,0,0,1


In [15]:
def myfunc(num):
    if num == 1:
        #alone
        return 0
    elif num >1 and num <=4:
        # small family
        return 1
    else:
        # large family
        return 2

In [16]:
myfunc(4)

1

In [17]:
X['Family_type'] = X['Family_size'].apply(myfunc)

In [18]:
X.head()

Unnamed: 0,Age,Pclass,SibSp,Parch,Family_size,Family_type
0,22.0,3,1,0,2,1
1,38.0,1,1,0,2,1
2,26.0,3,0,0,1,0
3,35.0,1,1,0,2,1
4,35.0,3,0,0,1,0


In [19]:
X.drop(columns=['SibSp','Parch','Family_size'],inplace=True)

In [20]:
X.head()

Unnamed: 0,Age,Pclass,Family_type
0,22.0,3,1
1,38.0,1,1
2,26.0,3,0
3,35.0,1,1
4,35.0,3,0


In [21]:
np.mean(cross_val_score(LogisticRegression(),X,y,scoring='accuracy',cv=20))

np.float64(0.7003174603174602)

### Feature Splitting

In [23]:
df = pd.read_csv("45_train.csv")

In [24]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [25]:
df["Name"]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [26]:
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

In [28]:
df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]

Unnamed: 0,0
0,Mr
1,Mrs
2,Miss
3,Mrs
4,Mr
...,...
886,Rev
887,Miss
888,Miss
889,Mr


In [30]:
df[['Title','Name']]

Unnamed: 0,Title,Name
0,Mr,"Braund, Mr. Owen Harris"
1,Mrs,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,Miss,"Heikkinen, Miss. Laina"
3,Mrs,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,Mr,"Allen, Mr. William Henry"
...,...,...
886,Rev,"Montvila, Rev. Juozas"
887,Miss,"Graham, Miss. Margaret Edith"
888,Miss,"Johnston, Miss. Catherine Helen ""Carrie"""
889,Mr,"Behr, Mr. Karl Howell"


In [32]:
(df[['Title','Survived']].groupby('Title').mean()['Survived']).sort_values(ascending=False)

Unnamed: 0_level_0,Survived
Title,Unnamed: 1_level_1
Lady,1.0
Ms,1.0
Sir,1.0
Mme,1.0
the Countess,1.0
Mlle,1.0
Mrs,0.792
Miss,0.697802
Master,0.575
Major,0.5


In [33]:
df['Is_Married'] = 0
df['Is_Married'].loc[df['Title'] == 'Mrs'] = 1

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['Is_Married'].loc[df['Title'] == 'Mrs'] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Is_Married']

In [35]:
df['Is_Married']

Unnamed: 0,Is_Married
0,0
1,1
2,0
3,1
4,0
...,...
886,0
887,0
888,0
889,0
