we will use pandas for the EDA analysis

In [2]:
import pandas as pd

The following cell will provide us with an overview of the data

In [3]:
data = pd.read_csv("atlantis_citizens_final.csv")
data.head()

Unnamed: 0,Citizen_ID,Diet_Type,District_Name,Occupation,Wealth_Index,House_Size_sq_ft,Life_Expectancy,Vehicle_Owned,Work_District,Bio_Hash
0,CIT_15935,Exotic Imports,Coral Slums,Scribe,1491.0,100.0,42.0,Fin Bicycle,Mariana Plaza,b81cb8ce
1,CIT_11623,Seafood,Coral Slums,Fisher,1596.0,100.0,49.0,Sea Scooter,Deep Trench,72f48eef
2,CIT_8026,Seafood,Mariana Plaza,Warrior,3921.0,533.0,37.0,Sea Scooter,Deep Trench,0abde296
3,CIT_0492,Exotic Imports,Deep Trench,Fisher,,136.0,38.0,Fin Bicycle,Deep Trench,8055fc9e
4,CIT_0275,Seaweed,Deep Trench,Warrior,25985.0,2673.0,54.0,Sea Scooter,Deep Trench,c77829e2


this cell will give use the dimensions of the data

In [4]:
data.shape

(15751, 10)

this will provide information about the different columns and the type of their data also the number of non-null values in that column

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15751 entries, 0 to 15750
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Citizen_ID        15751 non-null  object 
 1   Diet_Type         15751 non-null  object 
 2   District_Name     15751 non-null  object 
 3   Occupation        15751 non-null  object 
 4   Wealth_Index      14696 non-null  float64
 5   House_Size_sq_ft  14554 non-null  float64
 6   Life_Expectancy   15137 non-null  float64
 7   Vehicle_Owned     15751 non-null  object 
 8   Work_District     15751 non-null  object 
 9   Bio_Hash          15751 non-null  object 
dtypes: float64(3), object(7)
memory usage: 1.2+ MB


We see that except Wealth_Index, House_Size_sq_ft and Life_Expectancy no other column have null value, meaning they are clean, and House_Size_sq_ft has the highest number of null values

In [6]:
data.isnull().sum()

Citizen_ID             0
Diet_Type              0
District_Name          0
Occupation             0
Wealth_Index        1055
House_Size_sq_ft    1197
Life_Expectancy      614
Vehicle_Owned          0
Work_District          0
Bio_Hash               0
dtype: int64

we will now see if we have duplicates in the data

In [7]:
data["Bio_Hash"].duplicated().sum()
# we use Bio hash at place of citizen id because a same person may get multiple citizen id (like a fake id) but the biological information will always be unique to an individual

np.int64(0)

np.int64(0) means 0, it is just how it is represented in pandas(np is for numpy)
So, we don't have any duplicates in our dataset

We will separate the data into numerical columns and object(non-numerical columns), to make the analysis easier

In [8]:
num_cols = data.select_dtypes(include=["int64", "float64"]).columns
cat_cols = data.select_dtypes(include=["object"]).columns

num_cols, cat_cols

(Index(['Wealth_Index', 'House_Size_sq_ft', 'Life_Expectancy'], dtype='object'),
 Index(['Citizen_ID', 'Diet_Type', 'District_Name', 'Occupation',
        'Vehicle_Owned', 'Work_District', 'Bio_Hash'],
       dtype='object'))

Now we will perform univariate analysis on the numerical columns

In [9]:
data[num_cols].describe()

Unnamed: 0,Wealth_Index,House_Size_sq_ft,Life_Expectancy
count,14696.0,14554.0,15137.0
mean,9529.708628,1417.153772,51.913391
std,20502.182375,2233.194323,16.567063
min,1000.0,100.0,20.0
25%,1978.0,205.0,40.0
50%,3794.5,477.0,49.0
75%,8951.5,1470.75,60.0
max,589377.0,10000.0,110.0


we see that Wealth_Index and House_Size_sq_ft are heavily skewed (more accurately positively skewed as mean > median) while for Life_Expectancy mean is almost equal to the median hinting towards a more uniform data.

The following cell will tackle the outliners in our numerical columns using the iqr score method

In [10]:
#for the wealth index column
q1_w_i = data["Wealth_Index"].quantile(0.25)
q3_w_i = data["Wealth_Index"].quantile(0.75)

iqr_w_i = q3_w_i - q1_w_i

outliners_w_i = data[(data["Wealth_Index"] < q1_w_i - 1.5*iqr_w_i) | (data["Wealth_Index"] > q3_w_i + 1.5*iqr_w_i)]

#for the house size column

q1_h_s = data["House_Size_sq_ft"].quantile(0.25)
q3_h_s = data["House_Size_sq_ft"].quantile(0.75)

iqr_h_s = q3_h_s - q1_h_s

outliners_h_s = data[(data["House_Size_sq_ft"] < q1_h_s - 1.5*iqr_h_s) | (data["House_Size_sq_ft"] > q3_h_s + 1.5*iqr_h_s) ]

# for the life expectancy column

q1_l = data["Life_Expectancy"].quantile(0.25)
q3_l = data["Life_Expectancy"].quantile(0.75)

iqr_l = q3_l - q1_l

outliners_l = data[(data["Life_Expectancy"] < q1_l - 1.5*iqr_l) | (data["Life_Expectancy"] > q3_l + 1.5*iqr_l)]

outliners_w_i, outliners_h_s, outliners_l

(      Citizen_ID       Diet_Type    District_Name Occupation  Wealth_Index  \
 4       CIT_0275         Seaweed      Deep Trench    Warrior       25985.0   
 24     CIT_19196         Seafood  The Golden Reef      Miner       53282.0   
 29      CIT_8801         Seaweed    Mariana Plaza    Warrior       64022.0   
 39      CIT_6536         Seafood  The Golden Reef   Merchant       66593.0   
 49      CIT_3220         Seafood  The Golden Reef   Merchant       57368.0   
 ...          ...             ...              ...        ...           ...   
 15694   CIT_8495         Seaweed  The Golden Reef     Scribe      589377.0   
 15697  CIT_17067         Seafood  The Golden Reef      Miner       31314.0   
 15698   CIT_9396         Seafood  The Golden Reef      Miner       19541.0   
 15706   CIT_0957  Exotic Imports  The Golden Reef   Merchant       73174.0   
 15736   CIT_1192         Seaweed    Mariana Plaza    Warrior       39617.0   
 
        House_Size_sq_ft  Life_Expectancy    Vehic

Continuing we will now analyze the categorical columns

In [11]:
data["Occupation"].value_counts()  # frequency of occupations (most and least common occupations)

Occupation
Merchant    3535
Warrior     3531
Fisher      3136
Miner       3028
Scribe      2521
Name: count, dtype: int64

In [12]:
data["District_Name"].value_counts() # frequency of district (most and least populated district)

District_Name
The Golden Reef    4811
Deep Trench        4696
Coral Slums        3126
Mariana Plaza      3118
Name: count, dtype: int64

From the output of the above two cells we see that :

Most common Occupation = Merchant

Leat common Occupation = Scribe

Most Populated district = The Golden Reef

Least Populated district = Coral Slums

we will also compare the average wealth per district to get the idea of how the wealth is distributed among different areas.

In [13]:
data.groupby("District_Name")["Wealth_Index"].mean()

District_Name
Coral Slums         3371.414207
Deep Trench         4806.795825
Mariana Plaza       8535.553804
The Golden Reef    18726.909656
Name: Wealth_Index, dtype: float64

The result is quite interesting, Coral Slums and Deep Trench have way lower average wealth(with Coral Slums having the least), the Golden Reef has the highest average wealth, than even Mariana Plaza having the 2nd highest average wealth.

This shows high economic inequality between the districts.(The average wealth index of The Golden Reef is almost 6 times that of Coral Slums.

We will also check the life expectancy among the districts

In [14]:
data.groupby("District_Name")["Life_Expectancy"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
District_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Coral Slums,3030.0,44.307591,8.897222,20.0,38.0,44.0,50.0,83.0
Deep Trench,4505.0,43.758047,10.453507,20.0,36.0,42.0,50.0,94.0
Mariana Plaza,2996.0,51.851469,13.242605,20.0,42.0,51.0,60.0,110.0
The Golden Reef,4606.0,64.933565,18.879662,20.0,51.0,62.0,75.0,110.0


This shows that The Golden Reef has the highest average life expectancy while Deep Trench has the least, the difference between them is also quite significant.

Considering this and the previous cells output we can say that The Golden Reef is the best choice for district.

One more interesting point is that the range of life expectancy is highest for The golden reef and Mariana plaza districts (from 20 to 110), while Coral Slums have the least range for life expectancy (from 20 to 83), this mean that even when the average is high for the golden reef, the health inequality among the district itself is quite large.

We can also see how Occupation are distributed in specific districts

In [15]:
data.groupby(["District_Name", "Occupation"])["Occupation"].count()

District_Name    Occupation
Coral Slums      Fisher         925
                 Merchant       293
                 Miner         1275
                 Scribe         308
                 Warrior        325
Deep Trench      Fisher        1419
                 Merchant       516
                 Miner          458
                 Scribe         486
                 Warrior       1817
Mariana Plaza    Fisher         314
                 Merchant       306
                 Miner          324
                 Scribe        1259
                 Warrior        915
The Golden Reef  Fisher         478
                 Merchant      2420
                 Miner          971
                 Scribe         468
                 Warrior        474
Name: Occupation, dtype: int64

From the above output we can clearly see that all districts have a high concentration of a particular Occupation. For example The Golden Reef has the highest number/density of Merchants(possible reason for it's higher average wealth index)

Let's now analyze average wealth per occupation

In [16]:
data.groupby("Occupation")["Wealth_Index"].mean()

Occupation
Fisher       4263.595556
Merchant    18780.172007
Miner        7454.654064
Scribe       8400.855990
Warrior      7496.336078
Name: Wealth_Index, dtype: float64

This shows that Merchant is the most economically stable occupation while Fisher is the least stable one. Also, the gap between them is huge almost 3 times the average wealth of Fisher.

Now we will compare the occupation with their average life expectancy

In [25]:
data.groupby("Occupation")["Life_Expectancy"].median()

Occupation
Fisher      43.0
Merchant    65.0
Miner       46.0
Scribe      58.0
Warrior     39.0
Name: Life_Expectancy, dtype: float64

From above, we can see that Merchant has the highest life expectancy while Warriors have the least


Let's also check if there is an individual engaged in more than one occupation

In [18]:
(data.groupby("Bio_Hash")["Occupation"].nunique() > 1).any()

np.False_

The output is false meaning there is no one who is engaged in more than one occupation

We will now move from univariate analysis to Bivariate analysis to see how different columns (entities) are related to one another

first we will see the relation between money (wealth index) and the life expectancy

In [19]:
correlation_wealth_life = data["Wealth_Index"].corr(data["Life_Expectancy"])
correlation_wealth_life

np.float64(0.5880947579165877)

The correlation between wealth and life expectancy is approximately 0.59 indicating a moderate positive relationship between them, meaning that in general wealthy people of atlantis live longer than the poor, but it is not always true, other factors are also responsible for life expectancy (like work district, Occupation etc.)

Let's also check how housing affects the life expectancy

In [20]:
correlation_housing_life = data["House_Size_sq_ft"].corr(data["Life_Expectancy"])
correlation_housing_life

np.float64(0.7977717844528226)

The correlation between house size and life expectancy is almost 0.8 which is way more than the correlation between wealth and life expectancy, this shows that house size affects the life expectancy more than wealth.

Now we will check the effect of wealth on the individuals dietary intake

In [21]:
data.groupby("Diet_Type")["Wealth_Index"].mean()

Diet_Type
Exotic Imports    10366.435328
Seafood            9966.829105
Seaweed            8261.324093
Name: Wealth_Index, dtype: float64

As we can see from the output, Exotic imports are eaten by more rich while Seaweed are more common for the poor, but difference between them is not much indicating that the food intake depends less on wealth but more on other factors like personal preference, nutritional requirements etc.

Let's also see the relation with Diet and Occupation

In [22]:
data.groupby(["Diet_Type", "Occupation"])["Occupation"].count()

Diet_Type       Occupation
Exotic Imports  Fisher        1057
                Merchant      1198
                Miner          999
                Scribe         830
                Warrior       1125
Seafood         Fisher        1029
                Merchant      1172
                Miner         1004
                Scribe         860
                Warrior       1217
Seaweed         Fisher        1050
                Merchant      1165
                Miner         1025
                Scribe         831
                Warrior       1189
Name: Occupation, dtype: int64

The values are almost similar for all 3 food types, hence there is no special preference of an individual from a certain occupation towards a specific food.

Now let's see the effect of diet on life expectancy

In [23]:
data.groupby("Diet_Type")["Life_Expectancy"].mean()

Diet_Type
Exotic Imports    46.862083
Seafood           53.395899
Seaweed           55.420387
Name: Life_Expectancy, dtype: float64

there is not much difference between the values, so we can say that diet type don't affect life expectancy much

At last let's see how many people per district commute to work to another district

In [24]:
commuters = data[data["District_Name"] != data["Work_District"]]

commuters_per_district = (
    commuters.groupby("District_Name")["Bio_Hash"]
    .nunique()
    .reset_index(name="Commuting_Out_Count")
)

commuters_per_district

Unnamed: 0,District_Name,Commuting_Out_Count
0,Coral Slums,2078
1,Deep Trench,2971
2,Mariana Plaza,1972
3,The Golden Reef,3506


As we can see there are a lot of commuters from all districts with highest from The Golden Reef and least from Mariana Plaza

Main Takeaways

1.The dataset enables reliable person-level analysis through the use of Bio_Hash, which uniquely and anonymously identifies individuals while preserving data integrity.

2.Significant district-wise variation in commuting behavior is observed, indicating a mismatch between residential areas and employment locations.

3.A moderate positive correlation (r = 0.58) exists between Wealth Index and Life Expectancy, suggesting that economic status is an important but not exclusive determinant of longevity. Also, relatively high correlation (r = 0.79) exists between house size and Life expectancy, which is rather surprising.

4.Clear economic and occupational disparities are evident across districts, reflected in differences in wealth, housing conditions, and life expectancy.

5.Financial standing influences lifestyle outcomes, with wealthier individuals generally exhibiting better dietary and living conditions