## Importing Dataset

In [4]:
import pandas as pd

df = pd.read_stata("YLT.dta")

# Preview the first few rows
print(df.head())


   respondentID  year    rsex yearsni                       placeliv  \
0             1  2012    Male      16           A small city or town   
1             2  2012    Male      16              A country village   
2             3  2012  Female      16           A small city or town   
3             4  2012    Male      16  A farm or home in the country   
4             5  2012    Male      16           A small city or town   

                          thisoct  \
0  At school or college full time   
1  At school or college full time   
2  At school or college full time   
3  At school or college full time   
4  At school or college full time   

                                           oct2yrs   typeschl  \
0  At college or university, and working part time    Grammar   
1                                Working full time  Secondary   
2         Going to college or university full time  Secondary   
3         Going to college or university full time    Grammar   
4                  

### Data Cleaning

#### We will perform data cleaning as required, with each step clearly documented to maintain transparency and align with the study’s objective. Since the research focuses on examining the role of gender in the importance of religious identity, cases with missing gender information will be excluded from the analysis.

In [5]:
# Filter the DataFrame to include only 'Male' and 'Female' genders
df_clean = df[df['rsex'].isin(['Male', 'Female'])]

# Check the result
print(df_clean['rsex'].value_counts())

rsex
Female     5373
Male       3753
Missing       0
Name: count, dtype: int64


#### To align with the study’s focus on 16-year-olds in Northern Ireland, we have filtered the dataset to include only respondents who reported being 16 years old. All other age groups have been excluded to maintain consistency and relevance to the research aim.

In [6]:
# Count how many respondents are in each age group
print(df['yearsni'].value_counts())


yearsni
16         8359
15           98
14           87
Missing      79
10           74
13           63
6            53
12           53
11           52
9            41
5            39
4            38
8            35
7            30
3            28
2            21
1            16
0             2
Name: count, dtype: int64


In [7]:
# Ensure all rows in df_clean have age == 16
df_clean = df_clean[df_clean['yearsni'] == 16]


In [8]:
# Display unique values in the 'relidimp' column
print(df['relidimp'].unique())


['Quite important', 'Not at all important', 'Neither important nor unimportant', 'Very important', 'I don't have a religious identity', 'Not very important', 'Don't know', 'Missing', 'Not asked']
Categories (9, object): ['Very important' < 'Quite important' < 'Neither important nor unimportant' < 'Not very important' ... 'I don't have a religious identity' < 'Don't know' < 'Missing' < 'Not asked']


#### In cleaning the "relidimp" (importance of religious identity) column, we chose to **exclude** responses labeled **"Not asked"** since these represent cases where the question was not presented to the respondent due to survey design. As such, they do not reflect a participant's choice or opinion. However, we **retained** responses marked as **"Missing"** because non-response can carry meaningful insights. For example, if one gender is more likely to skip the question, this may reflect differences in comfort, relevance, or identification with religious identity. Including these entries allows us to explore potential gendered patterns in disengagement or uncertainty related to religious identity.


In [9]:
df_clean = df[df['relidimp'] != 'Not asked']


In [10]:
# Save the cleaned DataFrame 
df_clean.to_stata("YLTclean.dta", write_index=False)


/tmp/ipykernel_6008/3026883381.py:2: ValueLabelTypeMismatch: 
Stata value labels (pandas categories) must be strings. Column yearsni contains
non-string labels which will be converted to strings.  Please check that the
Stata data file created has not lost information due to duplicate labels.

  df_clean.to_stata("YLTclean.dta", write_index=False)
/tmp/ipykernel_6008/3026883381.py:2: ValueLabelTypeMismatch: 
Stata value labels (pandas categories) must be strings. Column ownreligion contains
non-string labels which will be converted to strings.  Please check that the
Stata data file created has not lost information due to duplicate labels.

  df_clean.to_stata("YLTclean.dta", write_index=False)
