# Final Data Identification

For next Monday, you should aim to identify and load a dataset that you will work on for your final project.  This project will involve you exploring this dataset, including using either a classification or regression problem as a modeling component.  

Your dataset should have the following criteria:

- More than 1000 rows
- A mixture of variable types


In addition to acquiring the dataset, you should aim to begin your EDA process.  Be very clear about stating the questions that your code is engaging. Your final product will be a Jupyter notebook and 5 minute in-class presentation.  In advance of this, next Monday plan to introduce your project in 1 minute or less including some visual support.


----------------------------------------------------------------------------------------------------------------------

# __<u>Final Project</u>__

## <font color = navy>Mental Health in the Tech Industry</font>

![](images/mental-health.jpg) 

   ![](images/sad_programmer.jpg) 

![](images/life_motto.jpg)

### Columns 

This dataset (2014) contains the following data:

- Timestamp
- Age
- Gender
- Country
- state: If you live in the United States, which state or territory do you live in?
- self_employed: Are you self-employed?
- family_history: Do you have a family history of mental illness?
- treatment: Have you sought treatment for a mental health condition?
- work_interfere: If you have a mental health condition, do you feel that it interferes with your work?
- no_employees: How many employees does your company or organization have?
- remote_work: Do you work remotely (outside of an office) at least 50% of the time?
- tech_company: Is your employer primarily a tech company/organization?
- benefits: Does your employer provide mental health benefits?
- care_options: Do you know the options for mental health care your employer provides?
- wellness_program: Has your employer ever discussed mental health as part of an employee wellness program?
- seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help?
- anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?
- leave: How easy is it for you to take medical leave for a mental health condition?
- mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences?
- phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences?
- coworkers: Would you be willing to discuss a mental health issue with your coworkers?
- supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)?
- mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview?
- phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview?
- mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health?
- obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?
- comments: Any additional notes or comments

### <font color = red>This dataset is from a 2014 survey that measures attitudes towards mental health and frequency of mental health disorders in the tech workplace.</font>

### <font color = red>Source: Open Sourcing Mental Illness (OSMI)</font>

https://www.kaggle.com/osmi/mental-health-in-tech-survey

In [27]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression, SGDClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report, precision_recall_curve
from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

In [28]:
mhealth = pd.read_csv('data/mental_health_survey.csv')

In [29]:
mhealth.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
Timestamp                    1259 non-null object
Age                          1259 non-null int64
Gender                       1259 non-null object
Country                      1259 non-null object
state                        744 non-null object
self_employed                1241 non-null object
family_history               1259 non-null object
treatment                    1259 non-null object
work_interfere               995 non-null object
no_employees                 1259 non-null object
remote_work                  1259 non-null object
tech_company                 1259 non-null object
benefits                     1259 non-null object
care_options                 1259 non-null object
wellness_program             1259 non-null object
seek_help                    1259 non-null object
anonymity                    1259 non-null object
leave                        1259 non-null obj

In [30]:
mhealth.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


In [31]:
mhealth.tech_company.value_counts()

Yes    1031
No      228
Name: tech_company, dtype: int64

In [32]:
mhealth.Gender.value_counts()

Male                                              615
male                                              206
Female                                            121
M                                                 116
female                                             62
F                                                  38
m                                                  34
f                                                  15
Make                                                4
Woman                                               3
Male                                                3
Female (trans)                                      2
Man                                                 2
Female                                              2
Cis Male                                            2
queer                                               1
woman                                               1
Androgyne                                           1
maile                       

__Looking at the gender column, it appears this selection was not restricted in the survey as respondents were able to input anything. The column also contains typos such as "Make", "Malr", and "Msle".__

In [33]:
mhealth.Country.value_counts()

United States             751
United Kingdom            185
Canada                     72
Germany                    45
Netherlands                27
Ireland                    27
Australia                  21
France                     13
India                      10
New Zealand                 8
Poland                      7
Sweden                      7
Switzerland                 7
Italy                       7
Brazil                      6
Belgium                     6
South Africa                6
Israel                      5
Singapore                   4
Bulgaria                    4
Mexico                      3
Russia                      3
Finland                     3
Austria                     3
Portugal                    2
Croatia                     2
Greece                      2
Denmark                     2
Colombia                    2
Nigeria                     1
Czech Republic              1
Moldova                     1
Hungary                     1
Norway    

In [34]:
mhealth.obs_consequence.value_counts()

No     1075
Yes     184
Name: obs_consequence, dtype: int64

__It appears that the obs_consequence column which asks if the respondent has heard/seen negative consequences for workers with mental health conditions is solely Yes/No. This will be easy to convert it to binary for a Classification problem.__

In [35]:
obs_consequence_dum = pd.get_dummies(mhealth.obs_consequence, drop_first = True)

In [36]:
obs_consequence_dum.head()

Unnamed: 0,Yes
0,0
1,0
2,0
3,1
4,0


In [37]:
mhealth['obs_consequence_dum'] = obs_consequence_dum

In [38]:
mhealth.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,obs_consequence_dum
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,No,No,Some of them,Yes,No,Maybe,Yes,No,,0
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Maybe,No,No,No,No,No,Don't know,No,,0
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,No,No,Yes,Yes,Yes,Yes,No,No,,0
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,,1
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,No,No,Some of them,Yes,Yes,Yes,Don't know,No,,0


In [42]:
mhealth.obs_consequence_dum.dtype

dtype('uint8')

__By using the 'pd.get_dummies' method I converted the 'Yes' and 'No' values in the obs_consequence column to binary values to prepare for a Classification problem using LogisticRegression. By performing '.value_counts' on the data, the conversion was successful.__

In [44]:
mhealth.obs_consequence_dum.value_counts()

0    1075
1     184
Name: obs_consequence_dum, dtype: int64

In [45]:
mhealth.Age.value_counts()

 29             85
 32             82
 26             75
 27             71
 33             70
 28             68
 31             67
 34             65
 30             63
 25             61
 35             55
 23             51
 24             46
 37             43
 38             39
 36             37
 39             33
 40             33
 43             28
 41             21
 22             21
 42             20
 21             16
 45             12
 46             12
 44             11
 19              9
 18              7
 20              6
 48              6
 50              6
 51              5
 56              4
 49              4
 57              3
 54              3
 55              3
 47              2
 60              2
 11              1
 8               1
 5               1
 99999999999     1
-1726            1
 53              1
 58              1
 61              1
 62              1
 65              1
 72              1
 329             1
-29              1
-1          

In [46]:
tech_company_dum = pd.get_dummies(mhealth.tech_company, drop_first = True)

In [48]:
mhealth['tech_company_dum'] = tech_company_dum

In [49]:
mhealth.tech_company_dum.value_counts()

1    1031
0     228
Name: tech_company_dum, dtype: int64

As with all LogisticRegression problems

In [19]:
X = mhealth[['Age', 'tech_company']]
y = mhealth.obs_consequence
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [20]:
lgr = LogisticRegression()

In [21]:
lgr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [22]:
X_train.shape

(944, 2)

In [23]:
y_train.shape

(944,)

In [26]:
pred = lgr.predict_proba(X_test)
print (classification_report(y_test, pred))

ValueError: Mix type of y not allowed, got types {'binary', 'continuous-multioutput'}