# Data Analysis of Layoffs in the United States Tech Industry 

## Project Motivation and Background

As Computer Science students about to enter the job market, we're concerned about the volatility of the tech industry. We want to analyze and create a system that can help people understand the markets, plan an exit strategy, and alleviate these concerns.

## Project Goal:
The goal of our project is to analyze trends in companies' recent layoffs in a variety of industries (aerospace, travel, retail, etc.) and detect patterns and trends.  This will be done by looking at the number of employees laid off, the location of the companies, their stages, and the funds they have raised.




In [14]:
import pandas as pd
import numpy as np

# Read in the data
layoffs = pd.read_csv('layoffs.csv')
layoffs.head()

# one hot encoding for categorical variables 
print(f"Unique values for 'company': {len(layoffs['company'].unique())}")

# Adding dummy variables for location, industry, stage, and country
totalNewCols = len(layoffs['location'].unique()) + len(layoffs['industry'].unique()) + len(layoffs['stage'].unique()) + len(layoffs['country'].unique())
print(f"Total number of new columns: {totalNewCols}")

#loca = pd.get_dummies(layoffs['location'], prefix='location')
indu = pd.get_dummies(layoffs['industry'], prefix='industry')
stag = pd.get_dummies(layoffs['stage'], prefix='stage')
#coun = pd.get_dummies(layoffs['country'], prefix='country')

# drop the original columns
#layoffs.drop(['location', 'industry', 'stage', 'country'], axis=1, inplace=True)
layoffs.drop(['industry', 'stage'], axis=1, inplace=True)


# concat the new columns
#layoffs = pd.concat([layoffs, loca, indu, stag, coun], axis=1)
layoffs = pd.concat([layoffs, indu, stag], axis=1)


layoffs.head()

Unique values for 'company': 2021
Total number of new columns: 315


Unnamed: 0,company,location,total_laid_off,percentage_laid_off,date,country,funds_raised,industry_Aerospace,industry_Construction,industry_Consumer,...,stage_Series C,stage_Series D,stage_Series E,stage_Series F,stage_Series G,stage_Series H,stage_Series I,stage_Series J,stage_Subsidiary,stage_Unknown
0,N26,Berlin,71.0,0.04,2023-04-28,United States,1700.0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,Providoor,Melbourne,,1.0,2023-04-28,Australia,,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,Dropbox,SF Bay Area,500.0,0.16,2023-04-27,United States,1700.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Vroom,New York City,120.0,0.11,2023-04-27,United States,1300.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Greenhouse,New York City,100.0,0.12,2023-04-27,United States,110.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
traing_data = layoffs.drop(['company', 'date', 'location', 'country'], axis=1)

# Split the data into training and testing sets
train_set = traing_data.sample(frac=0.7, random_state=0)
test_set = traing_data.drop(train_set.index)

print (f"Training set shape: {train_set.shape}")
print (f"Testing set shape: {test_set.shape}")

Training set shape: (1782, 48)
Testing set shape: (763, 48)


In [18]:
from sklearn.linear_model import ridge_regression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import RepeatedKFold
from numpy import arange



#define cross-validation method to evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

#define model
model = RidgeCV(alphas=arange(0.000001, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error')

X = train_set.drop(['percentage_laid_off', 'total_laid_off'], axis=1)
X = X.fillna(0)
y = train_set["percentage_laid_off"]

X.to_csv('training_X.csv')

y.head()

#fit model
model.fit(X, y)





ValueError: 
All the 3000 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3000 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\rezak\anaconda3\envs\cs484\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\rezak\anaconda3\envs\cs484\lib\site-packages\sklearn\linear_model\_ridge.py", line 1126, in fit
    X, y = self._validate_data(
  File "c:\Users\rezak\anaconda3\envs\cs484\lib\site-packages\sklearn\base.py", line 584, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "c:\Users\rezak\anaconda3\envs\cs484\lib\site-packages\sklearn\utils\validation.py", line 1106, in check_X_y
    X = check_array(
  File "c:\Users\rezak\anaconda3\envs\cs484\lib\site-packages\sklearn\utils\validation.py", line 921, in check_array
    _assert_all_finite(
  File "c:\Users\rezak\anaconda3\envs\cs484\lib\site-packages\sklearn\utils\validation.py", line 161, in _assert_all_finite
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
Ridge does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
