# Women Representation in City Property of San Francisco, an Exploratoration

Welcome! This notebook will follow the process of initially exploring a dataset regarding the gender distribution of monuments in San Francisco. Let's start by loading in the packages we will need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Now let's load in our data and get it's dimensions.

In [None]:
df = pd.read_csv("../input/women-representation-in-city-property-sanfrancisco/WomenRepresentaionInCityProperty-SanFrancisco.csv")
df.shape

Let's take a look at the important characteristics of each column.

In [None]:
col_chr = pd.DataFrame({"Data Type":df.dtypes,
                                       "NA's":df.isnull().sum(),
                                       "Unique Values":df.nunique()})
col_chr

Note that the final 5 columns consist of only a single possible observation, with no NA values. Let's print the single observations for each of these variables.

In [None]:
df[["Current Police Districts","Current Supervisor Districts",
   "Analysis Neighborhoods","Neighborhoods","SF Find Neighborhoods"]].loc[1]

Now let's rename our columns and drop the uninteresting variables.

In [None]:
df_final = df.drop(df.columns[6:11],axis=1).rename(
    columns={"Department/Source":"dep_source",
             "Name":"name","Person":"person",
             "Gender":"gender","Reference":"ref",
             "Comments":"comments"}) \
             .replace({"gender":{"M & F":"F & M"}})
df_final.head()

Let's look at the columns in the order they are presented here. So, we will start with the Department column. Note that since this dataset focuses on gender, we can include a breakdown of each column with the associated genders they represent.

In [None]:
col_chr.loc["Department/Source"]

With 8 unique values and no NA's, we can see that the important question is how these observations are distributed.

In [None]:
dep = df_final.dep_source.value_counts()
dep = dep.to_frame()
plt.ylabel('Department')
plt.xlabel('Monuments/Statues')
plt.title('Distribution of Department Sources in SF Monuments')
plt.barh(dep.index,dep.dep_source)

Now let's check out the gender breakdown for the Department/Source column.

In [None]:
dep_gender2=df_final.replace({"gender":{"M & M":"M"}}) \
                    .groupby(["dep_source","gender"]).size() \
                    .to_frame().reset_index() \
                    .rename(columns={0:"Count","dep_source":"Department/Source"})
plt.xticks(rotation=45)
sns.barplot(x="Department/Source",y="Count",hue="gender",
            data=dep_gender2,
            hue_order=["M","F & M","F"],
           palette=["Red","grey","blue"])

Now let's look at the "name" column, which describes the location of each monument.

In [None]:
col_chr.loc["Name"]

With 82 unique values, each observation is a unique string. This column would perhaps be useful with a locational dataset to merge with, but on it's own doesn't give us much information.

As such, let's continue on to the next column; the "person" column, which denotes who the statue is of.

In [None]:
col_chr.loc["Person"]

Let's take a look at how many times various individuals are honored with a statue. The "person" column denotes who the subject is, so we will create a frequency histogram of said column. However, note that there are some statues of multiple people, so it is important to code a solution with this in mind. Here we will begin by revealing all the multi-person statues, and the individual statues that also honor those individuals.

In [None]:
multiples = df_final.loc[(df_final['gender']=="F & M")|
                         (df_final['gender']=="M & F")|
                         (df_final['gender']=="M & M")] \
                    .person.str.split(" and | & ",expand=True)
multiples=multiples.reset_index()
multiples=multiples[[0,1]]
multiples.columns = ["person_1","person_2"]
multiples.iloc[1].person_1 = "Minnie Ward"
multiples.iloc[2].person_1 = "Charlotte Shultz"
multiples.iloc[3].person_1 = "Walter Hass"
multiples.iloc[4].person_2 = "Syncip Family"
multiples.iloc[5].person_2 = "Koret (Family?)"
multiples.iloc[6].person_1 = "Herman Herbst"
multiples.iloc[7].person_2 = "Fulton (family?)"
multiples.iloc[8].person_1 = "Dianne Taube"
multiples.iloc[9].person_1 = "Thelma Doelger"
multiples=multiples.person_1.append(multiples.person_2).reset_index(drop=True)
multiples="|".join(multiples)
df_final[df_final['person'].str.contains(multiples,na=False)]

We have looked through every name in the person column, and it seems that no person that is the subject of a multi-individual statue is also the subject of a single subject statue. This makes things much easier, but we still have to decide whether to treat these multi-person statues as two separate observations or as a single. Since many of these people multi-person statues are of two partners, or of an unknown number of people under the denotion of a family, it is a wise choice to treat each multi-statue monument as a single individual instead of portioning into each unique individual.

So, let us plow ahead and look at the distribution of repeated individuals in San Francisco Monuments.

In [None]:
person_ftable = df_final.person.value_counts().to_frame()
plt.hist(person_ftable.person, density=False)
plt.xlabel('Times a Person is Repeated')
plt.ylabel('Unique persons')
plt.title('Frequency Histogram of Repeated Persons')
plt.show()

One person was embodied 6 times in various monuments. Let's see who it was.

In [None]:
person_ftable.loc[person_ftable["person"]==6].index[0]

George Moscone was the mayor of San Francisco from 1976 until 1978 when he was assassinated. It makes sense that he would be honored more than the average individual.

Let's look at the distribution of genders in monuments. Note that there is a subset of statues that include more than one person, specifically those with two males or one of each male and female subjects.

In [None]:
gender_ftable = df_final.gender.value_counts().to_frame().reset_index().replace({"index":{"M & F":"F & M"}}).groupby("index",sort = False).sum()
x_axis_order = ["M & M", "M", "F & M", "F"]
plt.xlabel('Gender Combinations')
plt.ylabel('Monuments/Statues')
plt.title('Distribution of Gender in SF Monuments')
plt.text("F & M",40,"M only = 54",fontsize=10)
plt.text("F & M",45,"F only = 19")
plt.text("F & M",35,"Both = 8")
plt.bar(gender_ftable.loc[x_axis_order].index, 
        gender_ftable.gender.loc[x_axis_order], 
        color = ['red', 'red', 'purple', 'blue'])

Here we can see that there are more statues of men than there are of women. 

Let's move on to the reference column.

In [None]:
col_chr.loc["Reference"]

The reference column includes a collection of unique strings, but some will be repeated. As such, we can check do a similar analysis as with the person column, where we check how many times various references are repeated.

In [None]:
ref_ftable = df_final.ref.value_counts().to_frame()
plt.hist(ref_ftable.ref, density=False)
plt.xlabel('Times a Reference is Repeated')
plt.ylabel('Unique References')
plt.title('Frequency Histogram of Repeated References')
plt.show()

So, we can see a right-skewed histogram which is more spread out than the person column. There seem to be 3 references repeated 5 times or more. Let's take a look at each:

In [None]:
ref_ftable[ref_ftable["ref"]>=5]

Now let's conduct a gender breakdown of the reference column. There are a lot of them, so lets include only references associated with more than 1 statue.

In [None]:
ref_gender2=df_final.groupby(["ref","gender"]).size() \
                    .to_frame().reset_index() \
                    .rename(columns={0:"Count","ref":"Reference"})
ref_freq=ref_gender2.groupby("Reference").sum("Count")
ref_considered="|".join(ref_freq[ref_freq["Count"]>=2].index.tolist())
ref_gender_final=ref_gender2[
    ref_gender2["Reference"].str.contains(ref_considered)]
sns.barplot(y="Reference",x="Count",hue="gender",
            data=ref_gender_final,
            hue_order=["M","F & M","F"],
            palette=["Red","grey","blue"],orient="h")

Now we can move on to the comments column:

In [None]:
col_chr.loc["Comments"]

Again, we will plot a distribution of how often various comments are repeated.

In [None]:
com_ftable = df_final.comments.value_counts().to_frame()
plt.hist(com_ftable.comments, density=False)
plt.xlabel('Times a Comment is Repeated')
plt.ylabel('Unique Comments')
plt.title('Frequency Histogram of Repeated Comments')
plt.show()

Here there are 3 comments repeated 3 or more times. Let's look at what they were.

In [None]:
com_ftable[com_ftable["comments"]>=3]

It seems that this information might denote the environment of a monument.

Let's do a gender breakdown of the comments column, again only considering comments that occur more than once.


In [None]:
com_gender2=df_final.groupby(["comments","gender"]).size() \
                    .to_frame().reset_index() \
                    .rename(columns={0:"Count","comments":"Comment"})
com_freq=com_gender2.groupby("Comment").sum("Count")
com_considered="|".join(com_freq[com_freq["Count"]>=2].index.tolist())
com_gender_final=com_gender2[
    com_gender2["Comment"].str.contains(com_considered)]
sns.barplot(y="Comment",x="Count",hue="gender",
            data=com_gender_final,
            hue_order=["M","F & M","F"],
            palette=["Red","grey","blue"],orient="h")

That concludes a case by case examination of each individual column. There is a lot of possibilities in terms of merging with other datasets, such as perhaps locational data.

Thanks to Ashwani Rathee for submitting the data to kaggle.

Original Data Source: “Representation of Women in City Property - City Administrator's List.” Data.gov, Publisher Data.sfgov.org, 7 Oct. 2020, catalog.data.gov/dataset/representation-of-women-in-city-property-city-administrators-list. 