# Interview Questions


#### 1. **How does the choice of regularization type (L1 vs L2) in Logistic Regression impact feature selection and model stability, especially in datasets with correlated features?**
   - A) L1 regularization performs better for highly correlated features by selecting all of them, while L2 regularization removes irrelevant features.
   - B) L1 regularization encourages sparsity and can eliminate some correlated features, whereas L2 regularization retains all features by shrinking their coefficients but may struggle with feature selection.
   - C) L1 regularization leads to better stability by assigning similar coefficients to correlated features, while L2 regularization results in a sparse model.
   - D) L1 regularization should only be used when there are no correlated features, while L2 regularization is optimal for highly correlated datasets.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** B) L1 regularization encourages sparsity and can eliminate some correlated features, whereas L2 regularization retains all features by shrinking their coefficients but may struggle with multicollinearity.

   **Explanation:** L1 regularization (Lasso) can produce sparse solutions by driving some coefficients to zero, effectively performing feature selection. L2 regularization (Ridge) shrinks all coefficients towards zero but does not remove any, which can help mitigate issues with multicollinearity.

#### 2. **How does the pruning of a Decision Tree influence both bias and variance, and what is the tradeoff associated with pruning when the training data contains noise?**
   - A) Pruning increases bias but reduces variance, leading to better generalization on noisy datasets by preventing the tree from fitting the noise.
   - B) Pruning decreases both bias and variance, improving model accuracy on noisy datasets.
   - C) Pruning increases variance but decreases bias, resulting in overfitting on noisy datasets.
   - D) Pruning does not affect bias but only reduces variance, making it suitable for perfectly clean datasets.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** A) Pruning increases bias but reduces variance, leading to better generalization on noisy datasets by preventing the tree from fitting the noise.

   **Explanation:** Pruning removes branches from the tree, which simplifies the model and increases bias. However, it reduces variance by preventing the tree from overfitting the noise in the training data, which improves generalization, especially in noisy datasets.

#### 3. **In Random Forest, how does the choice of the number of features to consider at each split interact with the number of trees, and what impact does this have on bias and variance?**
   - A) Choosing a lower number of features increases variance while decreasing bias; increasing the number of trees helps mitigate the variance.
   - B) A smaller number of features per split increases both bias and variance, and increasing the number of trees cannot balance this tradeoff.
   - C) Using fewer features per split increases variance and decreases bias, while a larger number of trees further increases variance.
   - D) Choosing fewer features per split increases bias while reducing variance, but increasing the number of trees helps lower bias without affecting variance.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** A) Choosing a lower number of features increases variance while decreasing bias; increasing the number of trees helps mitigate the variance.

   **Explanation:** When more features are used at each split, individual trees can capture more complex patterns, reducing bias but increasing variance. By increasing the number of trees, the Random Forest can better average out the predictions, mitigating the increase in variance.

#### 4. **How does Gradient Boosting differ from Random Forest in handling outliers, and why might one approach be more suitable than the other in datasets with significant noise?**
   - A) Gradient Boosting can be more sensitive to outliers as it iteratively fits residuals, potentially overfitting to the noise, while Random Forest is more robust due to averaging multiple trees.
   - B) Gradient Boosting automatically removes outliers through residual fitting, while Random Forest fits the noise by averaging over outliers.
   - C) Both Gradient Boosting and Random Forest handle outliers similarly, but Gradient Boosting converges faster due to residual correction.
   - D) Random Forest is more sensitive to outliers because it uses unpruned trees, while Gradient Boosting is less affected as it uses residuals.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** A) Gradient Boosting can be more sensitive to outliers as it iteratively fits residuals, potentially overfitting to the noise, while Random Forest is more robust due to averaging multiple trees.

   **Explanation:** Gradient Boosting iteratively fits the residuals of the previous models, which can cause it to overfit to outliers in noisy data. In contrast, Random Forest reduces the impact of outliers by averaging the predictions of many uncorrelated trees.

#### 5. **How does the precision-recall curve behave for a highly imbalanced dataset, and why might the F1-score be a better metric than accuracy for evaluating classifier performance in this scenario?**
   - A) The precision-recall curve will show high precision and low recall, making accuracy a reliable metric; the F1-score is less effective due to the imbalance.
   - B) The precision-recall curve will display significant tradeoffs, with precision dropping sharply as recall increases, indicating that the F1-score better captures the balance between precision and recall than accuracy.
   - C) The precision-recall curve does not change for imbalanced datasets, and accuracy remains the best metric for evaluating performance.
   - D) The precision-recall curve becomes meaningless in imbalanced datasets, so neither F1-score nor accuracy is suitable for evaluation.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** B) The precision-recall curve will display significant tradeoffs, with precision dropping sharply as recall increases, indicating that the F1-score better captures the balance between precision and recall than accuracy.

   **Explanation:** In highly imbalanced datasets, accuracy can be misleading because a model that predicts the majority class will appear highly accurate. The precision-recall curve highlights the tradeoff between precision and recall, and the F1-score, which combines both metrics, provides a more balanced evaluation.


#### 6. **How do the regularization parameters  lambda  in Linear Regression and  C  in Logistic Regression control the strength of regularization, and what impact do they have on the model's coefficients?**
   - A) Increasing  lambda  in Linear Regression decreases the regularization strength, while increasing  C  in Logistic Regression increases regularization strength, leading to larger coefficients.
   - B) A higher  lambda  value in Linear Regression increases regularization strength, shrinking coefficients towards zero, while a smaller  C  value in Logistic Regression also increases regularization strength, reducing coefficients.
   - C) Decreasing  lambda  in Linear Regression increases regularization strength, while decreasing  C  in Logistic Regression has no effect on the model's coefficients.
   - D) Increasing  lambda  in Linear Regression removes regularization completely, while increasing  C  in Logistic Regression causes coefficients to become sparse.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** B) A higher  lambda  value in Linear Regression increases regularization strength, shrinking coefficients towards zero, while a smaller  C  value in Logistic Regression also increases regularization strength, reducing coefficients.

   **Explanation:** In Linear Regression,  lambda  controls the regularization strength directly: a higher  lambda  increases the penalty on large coefficients, shrinking them toward zero. In Logistic Regression,  C  is the inverse of regularization strength: a smaller  C  increases the regularization, similarly shrinking the coefficients. Thus, higher  lambda  and smaller  C  both strengthen regularization.

#### 7. **How does the interpretation of  lambda  in Linear Regression differ from  C  in Logistic Regression, and what is the effect on bias and variance when adjusting these parameters?**
   - A) Increasing  lambda  in Linear Regression reduces bias and increases variance, while increasing  C  in Logistic Regression reduces variance and increases bias.
   - B) Increasing  lambda  in Linear Regression increases bias and decreases variance, while decreasing  C  in Logistic Regression increases regularization, leading to higher bias and lower variance.
   - C) In both cases, increasing  lambda  or  C  reduces bias and variance simultaneously, leading to improved generalization.
   - D) Decreasing  lambda  in Linear Regression and increasing  C  in Logistic Regression both increase regularization strength, leading to higher bias and lower variance.

   <br>
<br>
<br>
<br>
<br>
<br>
   **Correct Answer:** B) Increasing  lambda  in Linear Regression increases bias and decreases variance, while decreasing  C  in Logistic Regression increases regularization, leading to higher bias and lower variance.

   **Explanation:** In Linear Regression, increasing  lambda  strengthens regularization, which shrinks coefficients, increases bias, and reduces variance. In Logistic Regression,  C  is the inverse of regularization strength, so decreasing  C  increases regularization, which also leads to higher bias and lower variance. Thus, adjusting  lambda  and  C  affects the bias-variance tradeoff in similar ways.

# SVM Code

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import feature_extraction, model_selection, naive_bayes, metrics, svm
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

!gdown 1QViUZJ5UIBCgxB_qbOXTLs_2V48w7MWo

df = pd.read_csv('Spam_processed.csv', encoding='latin-1')
df.dropna(inplace = True)

display(df)

Downloading...
From: https://drive.google.com/uc?id=1QViUZJ5UIBCgxB_qbOXTLs_2V48w7MWo
To: /content/Spam_processed.csv
  0% 0.00/767k [00:00<?, ?B/s]100% 767k/767k [00:00<00:00, 71.2MB/s]


Unnamed: 0,type,message,cleaned_message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives around though
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,2nd time tried 2 contact u u å750 pound prize ...
5568,0,Will Ì_ b going to esplanade fr home?,ì_ b going esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...",pity mood suggestions
5570,0,The guy did some bitching but I acted like i'd...,guy bitching acted like interested buying some...


In [None]:
from sklearn.model_selection import train_test_split

df_X_train, df_X_test, y_train, y_test = train_test_split(df['cleaned_message'], df['type'],
                                                          test_size=0.25, random_state=47)
print([np.shape(df_X_train), np.shape(df_X_test)])

# CountVectorizer
f = feature_extraction.text.CountVectorizer()
X_train = f.fit_transform(df_X_train)
X_test = f.transform(df_X_test)

# StandardScaler
scaler = StandardScaler(with_mean=False) # problems with dense matrix
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print([np.shape(X_train), np.shape(X_test)])
print(type(X_train))

[(4173,), (1392,)]
[(4173, 7622), (1392, 7622)]
<class 'scipy.sparse._csr.csr_matrix'>


In [None]:
# SVC

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

params = {
          'C': [1e-4,  0.001, 0.01, 0.1, 1,10] # which hyperparam value of C do you think will work well?
         }

svc = SVC(class_weight={ 0:0.1, 1:0.5 }, kernel='linear')
clf = GridSearchCV(svc, params, scoring = "f1", cv=3)

clf.fit(X_train, y_train)

In [None]:
res = clf.cv_results_

for i in range(len(res["params"])):
  print(f"Parameters:{res['params'][i]} \n Mean score: {res['mean_test_score'][i]} \n Rank: {res['rank_test_score'][i]}")

Parameters:{'C': 0.0001} 
 Mean score: 0.6566305780023073 
 Rank: 6
Parameters:{'C': 0.001} 
 Mean score: 0.7742322485787693 
 Rank: 1
Parameters:{'C': 0.01} 
 Mean score: 0.767533370474547 
 Rank: 2
Parameters:{'C': 0.1} 
 Mean score: 0.7649416969151316 
 Rank: 3
Parameters:{'C': 1} 
 Mean score: 0.7649416969151316 
 Rank: 3
Parameters:{'C': 10} 
 Mean score: 0.7649416969151316 
 Rank: 3
