# Data Formats Task

### Objective:
Analyze raw data from Group 1 "Student task.txt" and find all the errors present in the file.

### Overview:
The given raw data (filename) does not have a direct representation of what kind of student data are we dealing with. However, after going through the data fields and with a bit of research, the data is likely to be a "Psychological or behavioral research study".

### Summary:
This dataset captures participant responses, reaction times, and error rates across trials in a psychological study assessing implicit associations between age groups (e.g., "Old People" and "Young People") and attributes (e.g., "Competent" and "Incompetent").

In [1]:
import numpy as np
import pandas as pd
raw_psy = pd.read_csv('Student task_Group1.txt', sep='\t')
raw_psy.head()

  raw_psy = pd.read_csv('Student task_Group1.txt', sep='\t')


Unnamed: 0,block_number,block_name,block_trial_count,block_pairing_definition,study_name,task_number,task_name,trial_number,trial_name,trial_response,trial_latency,trial_error,session_id
0,5,BLOCK5,24,"Old People/Incompetent,Young People/Competent",NosekLab.nicolelindner.maxnetuc.0001,8,compoy,9,Independent,Young People/Competent,563,0,888619
1,5,BLOCK5,24,"Old People/Incompetent,Young People/Competent",NosekLab.nicolelindner.maxnetuc.0001,8,compoy,10,om1.jpg,Old People/Incompetent,430,0,888619
2,5,BLOCK5,24,"Old People/Incompetent,Young People/Competent",NosekLab.nicolelindner.maxnetuc.0001,8,compoy,11,Skilled,Young People/Competent,524,0,888619
3,5,BLOCK5,24,"Old People/Incompetent,Young People/Competent",NosekLab.nicolelindner.maxnetuc.0001,8,compoy,12,ym5.jpg,Young People/Competent,487,0,888619
4,5,BLOCK5,24,"Old People/Incompetent,Young People/Competent",NosekLab.nicolelindner.maxnetuc.0001,8,compoy,13,Competent,Young People/Competent,960,0,888619


### 1. Basic sanity checks on the data set:
##### > Data types of the fields and unique identities
##### > Null/ nan/ ' '/ -> check nulls
##### > Manual errors -> typos

In [59]:
# Check the data types and unique identities of the fields.
raw_psy.info()
for col in raw_psy.columns:
    print("--------------------------------------------------")
    print(f"{col}: {raw_psy[col].unique()}")

# Check for any NaN or empty string values in each cell
null_values = raw_psy.isnull() | (raw_psy == '')

# Identify the rows and columns where these null values are located
nan_locations = [(index, col) for index, row in null_values.iterrows() for col, is_null in row.items() if is_null]

if nan_locations:
    print("________________________________________________________________________________________________________")
    print("Null values found at:")
    for index, col in nan_locations:
        print(f"Index: {index}, Column: {col}")
else:
    print("No null or empty values found in the DataFrame.")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144552 entries, 0 to 144551
Data columns (total 13 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   block_number               144552 non-null  int64 
 1    block_name                144551 non-null  object
 2    block_trial_count         144552 non-null  object
 3    block_pairing_definition  144552 non-null  object
 4    study_name                144552 non-null  object
 5    task_number               144551 non-null  object
 6    task_name                 144552 non-null  object
 7    trial_number              144552 non-null  int64 
 8    trial_name                144552 non-null  object
 9    trial_response            144552 non-null  object
 10   trial_latency             144552 non-null  int64 
 11   trial_error               144552 non-null  int64 
 12   session_id                144552 non-null  int64 
dtypes: int64(5), object(8)
memory usage: 14.3+ M

### 2. Duplicate records


In [65]:
print(f"Number of duplicated rows: {raw_psy.duplicated().sum()}")

Number of duplicated rows: 0


###### There are no duplicates.

### 3. Data type Corrections
During the check on the data types of the fields and the data frame, we observe that there are few fileds that are either int64 or object which can be casted as string since they are fixed entities. Provided there are entities which are prone to be altered in the future.
   


In [118]:
raw_psy.columns = raw_psy.columns.str.strip()
raw_psy = raw_psy.astype("string")
# Convert specific columns (block_trial_count, trial_latency ) to integers as the values can be altered later.
raw_psy['block_trial_count'] = pd.to_numeric(raw_psy['block_trial_count'], errors='coerce').fillna(0).astype(int)
raw_psy['trial_latency'] = pd.to_numeric(raw_psy['trial_latency'], errors='coerce').fillna(0).astype(int)
raw_psy.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144552 entries, 0 to 144551
Data columns (total 13 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   block_number              144552 non-null  string
 1   block_name                144552 non-null  string
 2   block_trial_count         144552 non-null  int32 
 3   block_pairing_definition  144552 non-null  string
 4   study_name                144552 non-null  string
 5   task_number               144552 non-null  string
 6   task_name                 144552 non-null  string
 7   trial_number              144552 non-null  string
 8   trial_name                144552 non-null  string
 9   trial_response            144552 non-null  string
 10  trial_latency             144552 non-null  int32 
 11  trial_error               144552 non-null  string
 12  session_id                144552 non-null  string
dtypes: int32(2), string(11)
memory usage: 13.2 MB


##### Mistakes observed after aforementioned analysis:
* block_trial_count - 'km24' is not a number
* task_number - 'a8' is not a number;
* task_name - 'compoy' instead of compony - a manual error / could be a typo
* study_name - Incomplete study_name - 'maxnetuc.0001'
* trial_number - Should be a whole number. We have a '-3' value in the field
* trial_name - We have ".jpg" values that is symantically impractical to  be with the other values of the field.
* trial_response - Incomplete field values // may be that is how it is
* Null values found in block_name and task_number