In corporate settings, data scientists widely use the following methods to handle multicollinearity:

1. **Principal Component Analysis (PCA)**: PCA is a popular technique for reducing dimensionality while retaining most of the information in the data. It is widely used in corporate settings because it is easy to implement and interpret.
2. **Lasso Regression**: Lasso regression is a widely used regularization technique that adds a penalty term to the loss function to shrink the coefficients of correlated variables. It is commonly used in corporate settings because it can handle high-dimensional data and is robust to multicollinearity.
3. **Cross-validation**: Cross-validation is a widely used technique for evaluating the performance of a model on unseen data. It is commonly used in corporate settings because it provides a robust estimate of the model's performance and can help prevent overfitting.

The choice of method depends on the specific problem and dataset. Here's a general outline of when and how to use each method:

**Data Preprocessing**:

* **Data exploration**: Use correlation matrices, scatter plots, and other visualization techniques to understand the relationships between variables.
* **Feature selection**: Use techniques like recursive feature elimination, mutual information, or permutation importance to select the most relevant features.
* **Data transformation**: Use log transformation or standardization to reduce multicollinearity.

**Dimensionality Reduction**:

* **PCA**: Use PCA when the dataset has a large number of variables and you want to reduce dimensionality while retaining most of the information in the data.
* **t-SNE**: Use t-SNE when you want to identify clusters and patterns in the data.
* **Autoencoders**: Use autoencoders when you want to reduce dimensionality and learn features from the data.

**Regularization Techniques**:

* **Lasso Regression**: Use lasso regression when you want to handle high-dimensional data and are concerned about multicollinearity.
* **Ridge Regression**: Use ridge regression when you want to reduce the impact of multicollinearity but still want to retain all the variables.
* **Elastic Net**: Use elastic net when you want to handle both sparse and dense data.

**Model Selection**:

* **Cross-validation**: Use cross-validation to evaluate the performance of a model on unseen data and prevent overfitting.
* **Grid search**: Use grid search to find the best hyperparameters for a model.
* **Bayesian model selection**: Use Bayesian model selection when you want to select the best model based on Bayesian principles.

In terms of when to use each method, here's a general outline:

1. **Data preprocessing**: Perform data preprocessing before modeling to detect and handle multicollinearity.
2. **Dimensionality reduction**: Use dimensionality reduction techniques when the dataset has a large number of variables and you want to reduce dimensionality.
3. **Regularization techniques**: Use regularization techniques when you want to handle high-dimensional data and are concerned about multicollinearity.
4. **Model selection**: Use model selection techniques to choose the best model that handles multicollinearity.

Here's an example of how a data scientist might use these methods in practice:

1. **Load the dataset**: Load the dataset and perform data exploration to understand the relationships between variables.
2. **Preprocess the data**: Preprocess the data by selecting the most relevant features and transforming variables to reduce multicollinearity.
3. **Reduce dimensionality**: Use PCA to reduce dimensionality and retain most of the information in the data.
4. **Use regularization techniques**: Use lasso regression to handle high-dimensional data and reduce the impact of multicollinearity.
5. **Evaluate the model**: Use cross-validation to evaluate the performance of the model on unseen data and prevent overfitting.
6. **Select the best model**: Use grid search or Bayesian model selection to select the best model that handles multicollinearity.

By following this outline, data scientists can effectively handle multicollinearity in large datasets and build robust models that generalize well to new data.

In [1]:
import pandas as pd 
import numpy as np 

data = {
    'Price': [250000, 300000, 200000, 350000, 280000, 320000, 240000, 380000, 260000, 290000],
    'Size': [1500, 2000, 1200, 2500, 1800, 2200, 1400, 2800, 1600, 2000],
    'Bedrooms': [3, 4, 2, 5, 3, 4, 2, 5, 3, 4],
    'Bathrooms': [np.random.randint(1, 5) for _ in range(10)],
    'Age': [np.random.randint(1, 20) for _ in range(10)],
    'Distance_to_City': [np.random.randint(1, 10) for _ in range(10)],
    'Number_of_Floors': [np.random.randint(1, 5) for _ in range(10)],
    'Lot_Size': [np.random.randint(4000, 8000) for _ in range(10)]
}

# Create a pandas dataframe
df = pd.DataFrame(data)

# Introduce high correlation between Price, Size, and Bedrooms
df['Price'] = df['Size'] * 100 + df['Bedrooms'] * 10000 + np.random.randint(-10000, 10000)
df['Size'] = df['Bedrooms'] * 300 + np.random.randint(-500, 500)

# Print the dataframe
df.head()

Unnamed: 0,Price,Size,Bedrooms,Bathrooms,Age,Distance_to_City,Number_of_Floors,Lot_Size
0,176446,618,3,1,6,7,3,5991
1,236446,918,4,2,12,7,4,6389
2,136446,318,2,4,5,7,2,5232
3,296446,1218,5,2,5,4,4,6338
4,206446,618,3,2,4,6,3,6678


---
### Principal Component Analysis

Here's a detailed explanation of the parameters of PCA and everything related to it:

**PCA Parameters**

1. **n_components**: This parameter determines the number of principal components to retain. It can be an integer or a float.
	* If it's an integer, it specifies the number of components to retain.
	* If it's a float, it specifies the percentage of variance to retain.
2. **copy**: This parameter determines whether to copy the input data or not. If it's `True`, the input data is copied, and if it's `False`, the input data is modified in place.
3. **whiten**: This parameter determines whether to whiten the data or not. If it's `True`, the data is whitened, and if it's `False`, the data is not whitened.
4. **svd_solver**: This parameter determines the solver to use for the SVD decomposition. The available options are:
	* `auto`: The solver is automatically chosen based on the size of the input data.
	* `full`: The full SVD decomposition is used.
	* `arpack`: The ARPACK solver is used.
	* `randomized`: The randomized SVD solver is used.
5. **tol**: This parameter determines the tolerance for the SVD decomposition. It's used to determine when to stop the iteration.
6. **iterated_power**: This parameter determines the number of iterations for the SVD decomposition.
7. **random_state**: This parameter determines the random seed for the SVD decomposition.

**PCA Methods**

1. **fit**: This method fits the PCA model to the input data.
2. **transform**: This method transforms the input data into the new coordinate system.
3. **fit_transform**: This method fits the PCA model to the input data and transforms it into the new coordinate system.
4. **inverse_transform**: This method transforms the data back into the original coordinate system.
5. **get_covariance**: This method returns the covariance matrix of the input data.
6. **get_precision**: This method returns the precision matrix of the input data.

**PCA Attributes**

1. **components_**: This attribute returns the principal components of the input data.
2. **explained_variance_ratio_**: This attribute returns the explained variance ratio of each principal component.
3. **singular_values_**: This attribute returns the singular values of the input data.
4. **mean_**: This attribute returns the mean of the input data.
5. **n_features_**: This attribute returns the number of features in the input data.
6. **n_samples_**: This attribute returns the number of samples in the input data.

**PCA Example**

Here's an example of using PCA in Python:
```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import numpy as np

# Load the iris dataset
iris = load_iris()
X = iris.data

# Create a PCA object with 2 components
pca = PCA(n_components=2)

# Fit the PCA model to the data
pca.fit(X)

# Transform the data into the new coordinate system
X_pca = pca.transform(X)

# Print the explained variance ratio
print(pca.explained_variance_ratio_)

# Print the principal components
print(pca.components_)

# Plot the data in the new coordinate system
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
```
This example loads the iris dataset, creates a PCA object with 2 components, fits the PCA model to the data, transforms the data into the new coordinate system, and plots the data in the new coordinate system.

**PCA Advantages**

1. **Dimensionality reduction**: PCA can reduce the dimensionality of the input data, making it easier to visualize and analyze.
2. **Noise reduction**: PCA can reduce the noise in the input data, making it easier to identify patterns and relationships.
3. **Feature extraction**: PCA can extract the most important features from the input data, making it easier to identify the underlying structure of the data.
4. **Data visualization**: PCA can be used to visualize the input data in a lower-dimensional space, making it easier to understand the relationships between the variables.

**PCA Disadvantages**

1. **Assumes linearity**: PCA assumes that the relationships between the variables are linear, which may not always be the case.
2. **Sensitive to outliers**: PCA can be sensitive to outliers in the input data, which can affect the accuracy of the results.
3. **Not suitable for non-normal data**: PCA assumes that the input data is normally distributed, which may not always be the case.
4. **Can be computationally expensive**: PCA can be computationally expensive, especially for large datasets.

**PCA Applications**

1. **Data visualization**: PCA can be used to visualize the input data in a lower-dimensional space, making it easier to understand the relationships between the variables.
2. **Feature extraction**: PCA can be used to extract the most important features from the input data, making it easier to identify the underlying structure of the data.
3. **Noise reduction**: PCA can be used to reduce the noise in the input data, making it easier to identify patterns and relationships.
4. **Dimensionality reduction**: PCA can be used to reduce the dimensionality of the input data, making it easier to analyze and visualize.
5. **Anomaly detection**: PCA can be used to detect anomalies in the input data, making it easier to identify unusual patterns and relationships.
6. **Clustering**: PCA can be used to cluster the input data, making it easier to identify groups and patterns in the data.
7. **Regression**: PCA can be used to improve the accuracy of regression models by reducing the dimensionality of the input data and removing noise.

---

In [8]:
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler

x = df.drop('Price' , axis = 1)

x_scaled = StandardScaler().fit_transform(x)

pca = PCA(n_components= 2)
x_pca = pca.fit_transform(x_scaled)

print(pca.explained_variance_)
x_pca

[3.55818964 1.54206507]


array([[-0.48894738, -1.68224541],
       [ 1.16868454, -0.7538723 ],
       [-2.26219909, -0.11600993],
       [ 1.81355444, -0.3249741 ],
       [-0.23621337, -0.52777293],
       [ 0.04671478,  2.84821574],
       [-3.5861117 , -0.42339797],
       [ 2.6442198 , -0.61727154],
       [-0.4709771 ,  1.0578455 ],
       [ 1.37127507,  0.53948293]])

---
Let's consider a real-life example to illustrate how multicollinearity can affect model performance.

**Example:**

Suppose we are a marketing team for a company that sells cars, and we want to build a model to predict the price of a car based on its features. We collect data on the following variables:

* Price (response variable)
* Engine size (predictor variable)
* Horsepower (predictor variable)
* Number of cylinders (predictor variable)
* Weight (predictor variable)

We collect data on 100 cars and build a multiple linear regression model to predict the price of a car based on these features. The data looks like this:

| Car | Price | Engine Size | Horsepower | Number of Cylinders | Weight |
| --- | --- | --- | --- | --- | --- |
| 1 | $20,000 | 2.0L | 150HP | 4 | 1500kg |
| 2 | $25,000 | 2.5L | 200HP | 6 | 1700kg |
| 3 | $30,000 | 3.0L | 250HP | 8 | 2000kg |
|... |... |... |... |... |... |
| 100 | $40,000 | 4.0L | 350HP | 12 | 2500kg |

**Multicollinearity:**

Upon analyzing the data, we notice that the variables "Engine size", "Horsepower", and "Number of cylinders" are highly correlated with each other. For example:

* Cars with larger engines tend to have more horsepower and more cylinders.
* Cars with more horsepower tend to have larger engines and more cylinders.

This means that these variables are not independent of each other, and we have multicollinearity in our data.

**Effects of Multicollinearity:**

Because of multicollinearity, our model may produce unstable coefficients and poor predictions. For example:

* The coefficient for "Engine size" may be very large and positive, indicating that larger engines are associated with higher prices. However, this may be due to the fact that larger engines are also associated with more horsepower and more cylinders, which are also correlated with higher prices.
* The coefficient for "Horsepower" may be very small and negative, indicating that more horsepower is associated with lower prices. However, this may be due to the fact that more horsepower is also associated with larger engines and more cylinders, which are correlated with higher prices.

As a result, our model may produce poor predictions and may not accurately capture the relationships between the variables.

**Consequences:**

The consequences of multicollinearity in this example are:

* Our model may not accurately predict the price of a car based on its features.
* We may make incorrect conclusions about the relationships between the variables, such as thinking that larger engines are associated with higher prices when in fact it's the combination of larger engines, more horsepower, and more cylinders that's driving the price.
* We may overfit the model to the training data, which means that it will perform poorly on new, unseen data.

**Solutions:**

To address multicollinearity, we could:

* Remove one or more of the correlated variables from the model.
* Use dimensionality reduction techniques, such as principal component analysis (PCA), to reduce the number of variables.
* Use regularization techniques, such as Lasso or Ridge regression, to reduce the impact of multicollinearity.
* Collect more data to increase the sample size and reduce the correlation between variables.

By addressing multicollinearity, we can build a more robust and accurate model that provides valuable insights into the relationships between the variables.
