<a href="https://colab.research.google.com/github/rajivbits/IPython-Notebooks/blob/master/PerigonAI_Evaluation_Rajiv_C_(Data_Scientist).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Problem 1: Python programming, data processing.



In this problem we want to generate pseudo-random data that has certain desired statistical properties. This can be useful for demo, research or testing purposes.

First, let’s generate these “desired statistical properties”.

 - Generate a random 6x6 correlation matrix rho.

 - **Regularization**: write a test checking that rho is a valid correlation matrix, and if not - find the nearest valid one.

Now, let’s generate the data that would have these properties.

 - Generate a set of 6 random variables (put them in a matrix 1000x6, the distribution doesn’t matter, but should be continuous), distributed between 0 and 1 with correlation defined by rho.

The below Python code performs the following tasks:
    
    1) Generates a random 6x6 matrix using numpy.
    2) Creates a correlation matrix from the random matrix using the np.corrcoef function.
    3) Defines a function is_pos_def to check if a matrix is positive definite.
    4) Checks if the correlation matrix is valid using the is_pos_def function.
    5) If the correlation matrix is not valid, it finds the nearest valid one by adding a small value to the diagonal and then prints the nearest valid correlation matrix. Otherwise, it prints the generated matrix as a valid correlation matrix.
    
    The code utilizes NumPy for matrix operations and correlation coefficient calculation.
    It also demonstrates the use of the np.eye function to create an identity matrix and the
    np.linalg.eigvals function to check for positive definiteness of the matrix.

In [6]:
import numpy as np

# Generate a random 6x6 matrix
random_matrix = np.random.rand(6, 6)

# Create a correlation matrix from the random matrix
rho = np.corrcoef(random_matrix, rowvar=False)

# Check if the correlation matrix is valid
def is_pos_def(matrix):
    return np.all(np.linalg.eigvals(matrix) > 0)

# If the correlation matrix is not valid, find the nearest valid one
if not is_pos_def(rho):
    nearest_corr_matrix = np.corrcoef(random_matrix + np.eye(6) * 0.01, rowvar=False)
    print("The nearest valid correlation matrix is: \n", nearest_corr_matrix)
else:
    print("The generated matrix is a valid correlation matrix: \n", rho)

The generated matrix is a valid correlation matrix: 
 [[ 1.         -0.05938082  0.09941322 -0.14773491 -0.7872452   0.41425299]
 [-0.05938082  1.          0.33169573  0.70918342  0.24296406  0.6341125 ]
 [ 0.09941322  0.33169573  1.          0.85562388  0.42443298 -0.09257481]
 [-0.14773491  0.70918342  0.85562388  1.          0.51485773  0.20146345]
 [-0.7872452   0.24296406  0.42443298  0.51485773  1.         -0.43635627]
 [ 0.41425299  0.6341125  -0.09257481  0.20146345 -0.43635627  1.        ]]


#### Slightly different, but related problem.

 - Apply PCA to reduce the dimensionality to 5.
 - Let the output variable y = round(x6).
 - Build a couple of classifiers of your choice to predict y from {x1, x2, …, x5}.
 - Compare their performance.

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import warnings

warnings.filterwarnings('ignore')

# Generate a random 6x6 matrix
random_matrix = np.random.rand(6, 6)

# Apply PCA to reduce the dimensionality to 5
pca = PCA(n_components=5)
pca_matrix = pca.fit_transform(random_matrix)

# Create a new variable y by rounding the 6th column of the random matrix
y = np.round(random_matrix[:, 5])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(pca_matrix, y, test_size=0.3, random_state=42)

# Train and evaluate classifiers
classifiers = [
    ('Linear Regression', LinearRegression()),
    ('AdaBoost', AdaBoostRegressor()),
    ('Random Forest', RandomForestRegressor()),
    ('Support Vector Machine', NuSVR(kernel='rbf'))
]

for name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(f"{name} accuracy: {r2_score(y_test, y_pred)}")

Linear Regression accuracy: -3.1312149669052722
AdaBoost accuracy: -1.0
Random Forest accuracy: 0.22619999999999996
Support Vector Machine accuracy: 0.22497953924701997


## Problem 2: Data Science, Model Build

**Dataset used** - 10_01_train_dataset.csv

**Key Assumptions**

Categorical variables are identified as object type variables and numberical variables which have less than 6 unique values

**Dataset Description**

You have been provided with a dataset that has 116 Rows, 123 Columns (mix of continuous and categorical variables)
and a target column.

The goal is to build a model that generalizes well over this dataset, you are free to transform the dataset as you
feel necessary. We are not looking for the highest scoring model. Our goal is to understand your thought process and
decision making.    
    

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('10_01_train_dataset.csv')

# Split the data into features and target
X = df.drop('target', axis=1)
## Logarithmically transforming the target due to high variance
y = np.log(df['target'])

# Data Exploration and Preprocessing
# Check for missing values
print(X.isnull().sum())


# Select continuous variables with distinct values greater than 10
num_cols = []
cat_cols = []
for col in X.columns:
    if X[col].dtype in ['int64', 'float64']:  # Check if the column is numeric
        if df[col].nunique() > 6:  # Check if the number of distinct values is greater than 6
            num_cols.append(col)
        else:
            cat_cols.append(col)


# Encode categorical variables

categorical_cols = list(X.select_dtypes(include=['object']).columns.values)

discrete_cols = categorical_cols + cat_cols

# print(discrete_cols)


for col in discrete_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# Scale the continuous variables

scaler = MinMaxScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])

# Feature Engineering
# Perform dimensionality reduction using PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)

# Perform feature selection using chi-squared test
# kbest = SelectKBest(score_func=chi2, k=10)
# X_kbest = kbest.fit_transform(X, y)

# Model Building
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.3, random_state=42)

# Build and evaluate a baseline model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"Linear Regression accuracy: {r2_score(y_test, y_pred)}")

# Experiment with different algorithms
classifiers = [
    ('AdaBoost', AdaBoostRegressor()),
    ('Random Forest', RandomForestRegressor()),
    ('Support Vector Machine', NuSVR(kernel='rbf'))
]

for name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(f"{name} accuracy: {r2_score(y_test, y_pred)}")
    # print(f"{name} classification report: \n{classification_report(y_test, y_pred)}")


In [29]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset
df = pd.read_csv('10_01_train_dataset.csv')

# Split the data into features and target
X = df.drop('target', axis=1)
## Logarithmically transforming the target due to high variance
y = np.log(df['target'])

# Data Exploration and Preprocessing
# Check for missing values
# print(X.isnull().sum())

# Encode categorical variables
num_cols = []
cat_cols = []
for col in X.columns:
    if X[col].dtype in ['int64', 'float64']:  # Check if the column is numeric
        if df[col].nunique() > 6:  # Check if the number of distinct values is greater than 6
            num_cols.append(col)
        else:
            cat_cols.append(col)


# Encode categorical variables

categorical_cols = list(X.select_dtypes(include=['object']).columns.values)

discrete_cols = categorical_cols + cat_cols



for col in discrete_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])

# Scale the continuous variables
scaler = MinMaxScaler()
X[num_cols] = scaler.fit_transform(X[num_cols])

# Feature Engineering
# Perform dimensionality reduction using PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X[num_cols])

# Perform feature selection using Lasso method
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
lasso_coef = lasso.coef_
lasso_cols = X.columns[lasso_coef != 0] #['fast_food_restaurant_0_25', 'restaurant_0_25']


# Model Building
# Split the data into training and testing sets, using lasso columns and discrete columns
X_train, X_test, y_train, y_test = train_test_split(X[num_cols + discrete_cols], y, test_size=0.3, random_state=42)

# Build and evaluate a baseline model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"Linear Regression accuracy: {r2_score(y_test, y_pred)}")

# Experiment with different algorithms
classifiers = [
    ('Adaptive Boosting', AdaBoostRegressor()),
    ('Random Forest', RandomForestRegressor()),
    ('Support Vector Machine', NuSVR(kernel='rbf'))
]

for name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(f"{name} accuracy: {r2_score(y_test, y_pred)}")



Linear Regression accuracy: -20.17779049924262
Adaptive Boosting accuracy: 0.08648863503492699
Random Forest accuracy: 0.10479318626963885
Support Vector Machine accuracy: 0.04379590760421792


## Problem 3

It's 3000 BC, and you are the leader of a tribe of 4000 people. You are leading your tribe to a new location where you must build a circular settlement from scratch. How big will it be and how long will it take to build a stone wall around it?

In this scenario, I am leading a tribe of 4000 people to a new location where I must build a circular settlement from scratch.

To determine the size and construction time of the settlement, I would consider the following factors:

**Population**

The size of the settlement will be determined by the number of people in the tribe. Known variable (4,000)

**Land area**

The available land area will influence the size of the settlement. (Unknown)

**Building materials**

The availability and accessibility of building materials will affect the construction time and the size of the settlement.

**Labor**

The number of workers and their skills will impact the construction time.

**Technology**

The level of technology available for quarrying, transporting, and placing the stones will also affect the construction time.


### Conclusion

Without specific information about the available land area, building materials, labor force, and technology, it is difficult to provide an exact size and construction time for the settlement and constructing a stone wall.

However, the factors mentioned above can be used to make informed decisions and plan the construction accordingly.

## Problem 4

Is there an inconsistency in the following paragraph?:

"A suburban located Starbucks makes on average $100,000 per month in revenue and has 10,500 square meters of an adjacent area dedicated to parking for visitors only. Despite good revenue and overall satisfaction with service, both the staff and visitors are complaining that parking is full more than half of the time."

## Answer

This statement is Inconsistent, assuming that there is NO typographical errors in this statement while mentioning the units of measurement.

It states that the Starbucks has 10,500 square meters of an adjacent area dedicated to parking for visitors only. This seems unusually large for a suburban location.

A more typical size for a parking area at a suburban Starbucks might be around 10,500 square feet, rather than square meters.