# Exercise 9: Cognitive Data Analysis (⭐️⭐️⭐️)

We have collected behavioural and cognitive data from 100 people as part of a study called GBIT (Great Britain Intelligence Test). The data consists in their performance in three cognitive tasks, as well as their demographics and measures of their mental health. This data are provided in two separate dataframes: cognitive.csv contains the results in the cognitive tasks, and demographics.csv contains the answers from questionnaires. The data were anonymised and a basic processing was already completed. The aim of this exercise is to properly clean the data and run some statistical tests to analyse them.

IMPORTANT: To be able to complete this exercise you must have completed the statistical theory and coding tutorial

## Data cleaning

1. Import the data in the format of a pandas dataframes.

In [101]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import statsmodels.api as sm

In [102]:
df_cog = pd.read_csv('cognitive_primer.csv')
df_dem = pd.read_csv('demographics_primer.csv')

2. Check the column headers, shape of the dataframe and types of columns. There is a column that should be an integer and is, instead a string. Detect that column and change the variable type.

In [103]:
df_cog.head()

Unnamed: 0,user_id,SummaryScore_Blocks,RT_Blocks,timepoint_by_site,SummaryScore_VerbalAnalogies,RT_VerbalAnalogies,SummaryScore_WordDefinitions
0,ca8eba947d4c4c419321f26b85d89fc9,11.0,4577.0,tp1,15,6439.36,17.0
1,13ba938814984873b73a70ef48e481c0,8.0,5554.0,tp1,8,5438.1786,17.0
2,a7252125-ecf8-4985-adb9-bf93e89142ec,14.0,4410.0,tp1,22,5166.8,11.0
3,1d5ed00c406a443f974cbf511695c9d5,10.0,3522.0,tp1,34,3429.119,13.0
4,a968d94b5d6344c6968c6ee840c5b944,11.0,4491.0,tp1,27,4087.027,20.0


In [104]:
df_dem.tail()

Unnamed: 0,user_id,Sex,Education,Age,Handedness,Residence,Ethnicity,Occupation,Salary,Residence (County)
8305,3c8c33784419410a90425ba720378126,Male,02_Degree,82.0,Right handed,United Kingdom,White,Retired,,
8306,3c8c33784419410a90425ba720378126,Male,01_School,82.0,Right handed,United Kingdom,Asian or Asian British,Retired,,
8307,6218275f-15b2-4c7b-ba55-2838c9153010,Female,02_Degree,20.0,Right handed,United Kingdom,White,Student,,Gloucestershire
8308,6218275f-15b2-4c7b-ba55-2838c9153010,Female,02_Degree,20.0,Right handed,United Kingdom,White,Student,,
8309,7152ef3105984cac9a8b784f5a314841,Female,01_School,60.0,Right handed,United Kingdom,White,Homemaker,,Lanarkshire


In [105]:
np.shape(df_cog)

(8000, 7)

In [106]:
np.shape(df_dem)

(8310, 10)

In [107]:
df_cog.dtypes

user_id                          object
SummaryScore_Blocks             float64
RT_Blocks                       float64
timepoint_by_site                object
SummaryScore_VerbalAnalogies      int64
RT_VerbalAnalogies              float64
SummaryScore_WordDefinitions    float64
dtype: object

In [108]:
df_dem.dtypes

user_id                object
Sex                    object
Education              object
Age                   float64
Handedness             object
Residence              object
Ethnicity              object
Occupation             object
Salary                 object
Residence (County)     object
dtype: object

In [109]:
df_dem.Age.astype(int)

0       19
1       66
2       55
3       74
4       61
        ..
8305    82
8306    82
8307    20
8308    20
8309    60
Name: Age, Length: 8310, dtype: int64

3. Currently, the index of the dataframe is not easy to interpret. Change it and replace it with the values in the user_id

In [110]:
df_cog.index = df_cog.user_id
df_dem.index = df_dem.user_id

In [111]:
df_cog.head()

Unnamed: 0_level_0,user_id,SummaryScore_Blocks,RT_Blocks,timepoint_by_site,SummaryScore_VerbalAnalogies,RT_VerbalAnalogies,SummaryScore_WordDefinitions
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ca8eba947d4c4c419321f26b85d89fc9,ca8eba947d4c4c419321f26b85d89fc9,11.0,4577.0,tp1,15,6439.36,17.0
13ba938814984873b73a70ef48e481c0,13ba938814984873b73a70ef48e481c0,8.0,5554.0,tp1,8,5438.1786,17.0
a7252125-ecf8-4985-adb9-bf93e89142ec,a7252125-ecf8-4985-adb9-bf93e89142ec,14.0,4410.0,tp1,22,5166.8,11.0
1d5ed00c406a443f974cbf511695c9d5,1d5ed00c406a443f974cbf511695c9d5,10.0,3522.0,tp1,34,3429.119,13.0
a968d94b5d6344c6968c6ee840c5b944,a968d94b5d6344c6968c6ee840c5b944,11.0,4491.0,tp1,27,4087.027,20.0


In [112]:
df_dem.head()

Unnamed: 0_level_0,user_id,Sex,Education,Age,Handedness,Residence,Ethnicity,Occupation,Salary,Residence (County)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8162511fe0dc4d04afe57193a8804afa,8162511fe0dc4d04afe57193a8804afa,Male,01_School,19.0,Right handed,United Kingdom,White,Worker,,
95bc4b40-60e9-4342-8de2-55f6d7deef7d,95bc4b40-60e9-4342-8de2-55f6d7deef7d,Male,01_School,66.0,Right handed,United Kingdom,White,Retired,,West Sussex
ac8098ba57d54e16a12bc8d72433ad0f,ac8098ba57d54e16a12bc8d72433ad0f,Male,01_School,55.0,Left handed,United Kingdom,White,Worker,£30-40K,Hampshire
95596ceabf6e45af80b6f5c9ab7ae2c8,95596ceabf6e45af80b6f5c9ab7ae2c8,Female,00_preGCSE,74.0,Right handed,United Kingdom,White,Unemployed/Looking for work,,
d76fe6d2ca0a4e11a8d9506b0b9fe2a7,d76fe6d2ca0a4e11a8d9506b0b9fe2a7,Male,01_School,61.0,Right handed,United Kingdom,White,Worker,£10-20K,Aberdeenshire


4. Find out which columns have missing values and how many there are for each column. If a column is more than 50% NA then drop it. Most of the analysis cannot be completed if the participants don't have demographics. So, after filtering the columns, filter the questionnaire dataframe for the remaining rows without missing values.

In [113]:
df_cog.isna().sum(axis=1)

user_id
ca8eba947d4c4c419321f26b85d89fc9        0
13ba938814984873b73a70ef48e481c0        0
a7252125-ecf8-4985-adb9-bf93e89142ec    0
1d5ed00c406a443f974cbf511695c9d5        0
a968d94b5d6344c6968c6ee840c5b944        0
                                       ..
827bbfac-bb57-42ba-8dad-2e74ae3d31a4    0
665efd6621d74307a6f0072a9ad115b5        0
7103cf2e-ee29-4959-be61-34d0041446e8    0
0a48f36e7ca546788526732b609d519c        0
9555f9b825b84bc29317965de7eacf93        0
Length: 8000, dtype: int64

In [114]:
df_cog.isna().sum(axis=0)

user_id                         0
SummaryScore_Blocks             0
RT_Blocks                       0
timepoint_by_site               0
SummaryScore_VerbalAnalogies    0
RT_VerbalAnalogies              1
SummaryScore_WordDefinitions    0
dtype: int64

In [115]:
df_dem.isna().sum(axis=0)

user_id                  0
Sex                      1
Education              226
Age                      0
Handedness               0
Residence                1
Ethnicity                0
Occupation             238
Salary                5145
Residence (County)    5114
dtype: int64

In [116]:
df_dem.drop(['Salary', 'Residence (County)'], axis=1, inplace=True)

In [117]:
to_remove = df_dem.isna().sum(axis=1) > 0

In [118]:
sum(to_remove)

239

In [119]:
df_dem = df_dem[~to_remove]

5. Find out if there are duplicates in both dataframes. If there are two rows that are fully duplicated, then keep the second entry. In case of the questionnaire dataframe, if a columrown is duplicated in the user_id, sex and Residence columns, but not the others, then drop both rows.

In [120]:
sum(df_cog.duplicated())

0

In [121]:
sum(df_dem.duplicated())

22

In [122]:
df_dem.drop_duplicates(subset = None, keep = "last", inplace=True)
df_dem

Unnamed: 0_level_0,user_id,Sex,Education,Age,Handedness,Residence,Ethnicity,Occupation
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8162511fe0dc4d04afe57193a8804afa,8162511fe0dc4d04afe57193a8804afa,Male,01_School,19.0,Right handed,United Kingdom,White,Worker
95bc4b40-60e9-4342-8de2-55f6d7deef7d,95bc4b40-60e9-4342-8de2-55f6d7deef7d,Male,01_School,66.0,Right handed,United Kingdom,White,Retired
ac8098ba57d54e16a12bc8d72433ad0f,ac8098ba57d54e16a12bc8d72433ad0f,Male,01_School,55.0,Left handed,United Kingdom,White,Worker
95596ceabf6e45af80b6f5c9ab7ae2c8,95596ceabf6e45af80b6f5c9ab7ae2c8,Female,00_preGCSE,74.0,Right handed,United Kingdom,White,Unemployed/Looking for work
d76fe6d2ca0a4e11a8d9506b0b9fe2a7,d76fe6d2ca0a4e11a8d9506b0b9fe2a7,Male,01_School,61.0,Right handed,United Kingdom,White,Worker
...,...,...,...,...,...,...,...,...
4fccc5e37ece41d8999de0af4b026a30,4fccc5e37ece41d8999de0af4b026a30,Female,01_School,59.0,Right handed,United Kingdom,Asian or Asian British,Retired
3c8c33784419410a90425ba720378126,3c8c33784419410a90425ba720378126,Male,02_Degree,82.0,Right handed,United Kingdom,White,Retired
3c8c33784419410a90425ba720378126,3c8c33784419410a90425ba720378126,Male,01_School,82.0,Right handed,United Kingdom,Asian or Asian British,Retired
6218275f-15b2-4c7b-ba55-2838c9153010,6218275f-15b2-4c7b-ba55-2838c9153010,Female,02_Degree,20.0,Right handed,United Kingdom,White,Student


In [123]:
df_dem.drop_duplicates(subset = ['Sex', 'Residence', 'user_id'], keep = False, inplace=True)

6. The variable residence includes many different countries. Check how many people are from the United Kingdom and how many are from other countries. As you can see, there aren't many people from each one of the other individual countries. Replace the values in that column with UK for all people from United Kingdom, and Other for all the other countries.

In [124]:
df_dem.Residence.value_counts()

United Kingdom        6651
United States          280
Canada                  67
Australia               41
Ireland {Republic}      32
                      ... 
Benin                    1
Antigua & Deps           1
Nigeria                  1
Lithuania                1
Albania                  1
Name: Residence, Length: 85, dtype: int64

In [125]:
to_change = df_dem["Residence"] == 'United Kingdom'
df_dem.Residence[~to_change] = 'Other'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_dem.Residence[~to_change] = 'Other'


In [129]:
df_dem.Residence = df_dem["Residence"].replace("United Kingdom", "UK")

In [130]:
df_dem.Residence.value_counts()

UK       6651
Other     832
Name: Residence, dtype: int64

7. Until now, you have worked with two separate dartaframes. But to be able to run the next steps of the analysis, you need to merge the two dataframes together. Merge together the dataframes based on the values in the user_id.

In [134]:
df_dem.drop(["user_id"], axis = 1, inplace=True)
df_cog.drop(["user_id"], axis = 1, inplace=True)

In [136]:
df_merged = pd.merge(df_cog, df_dem, on = "user_id")

## Data Analysis

Let's complete some descriptive statistics on the data. Check: