# Data Analysis

In [None]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import dexplot as dxp
import altair as alt
# !pip install dexplot
# !pip install altair

In [None]:
pd.set_option("display.max_rows", 500)
pd.set_option("display.max_columns", 500)
pd.set_option("display.width", 1000)
pd.options.mode.chained_assignment = (
    None  # default='warn', this removes warning on dropping columns
)

warnings.filterwarnings("ignore", category=DeprecationWarning) # Warning for future states of used libraries


## Data Import

In [None]:
# Read in the dataset prepared in Create_Target notebook
df = pd.read_csv("../Data/processed/dataset_2016-19_target")

## Data Overview
The first thing we want to do is familiarize ourselves with the data. We start by looking at the first few rows of our table, and then dive into some of the charicteristics of the features, like thier data type and unique value counts.

In [None]:
df.replace(["democratic republic of the congo"], ["DRC"], inplace=True) # This makes many visuals easier to read
df.head()

In [None]:
df.info(verbose=True, show_counts=True)

In [None]:
df.nunique()

Analyzing numerical and categorical data have different appraoches. Before getting in the weeds with either type, we want to split our dataframe up, starting with numerical data.

In [None]:
num_cols = ['t_resettlement','numppl','qn1d','qn2a','qn18b','qn26b','qn29c_months','qn30d','qn31d','qn32d','qn33d','qn34d'\
            ,'qn38b','ui_qn8a_annual']
num_cols_rename = ['t_resettlement','numppl','Age','Pre_Yrs_School','Avg_Hr_Worked','Mth_At_Residence',\
                   'Refugee_Medical_Mths','FS_Mths','TANF_Mths','RCA_Mths','SSI_Mths','GA_Mths','Housing','Salary_EST']
num_df = df[num_cols]
num_df.columns = num_cols_rename

# num_df = num_df[num_df['Salary_EST']<500000]
num_df.describe()

From personal knowledge of what these features would look like in the real world, these features averages and ranges all look relativly normal. The only exceptions might be the 16,000 monthly housing cost, and the 977,600 salary. These values are certianly possible, but could be also be a data entry or reponse mistake. Without being able to confirm or disqualify these numbers, we will want to use them in the model building, but for the sake of visualization, we might want to drop them.

Next lets take a look at the categorical features. Given that some of these values are take directly from survey questions, and others are constructed based off those answers, we are going to seperate them so that we can discuss them apart from one another.

In [None]:
cat_cols = ['qn1c','qn1f','qn1k','qn2b','qn3a','qn4a','qn4b','qn11a','qn13','qn19b','qn20','qn25a','qn26estate','qn26f'\
            ,'qn26h','qn29b','qn38a']

const_cols= ['ui_soi_pubassist','ui_soi','ui_cashassist','ui_lfp'
            ,'ui_emprate','ui_medicaidrma','ui_lpr','ui_school','ui_work']

cat_cols_rename = ['Marital Satus','Gender','Original Region','Pre Degree','Pre Civil Status','Pre Eng Exp','Post Eng Exp'\
                   ,'Work Since Resettle','Job Search Past Month','Job Industry','Job Sector','Attended School'\
                   ,'Curr State','Relocation Reasion','Child Edu Participation','Med Care Source','House Ownship']

const_cols_rename = ['Public Assist','Source of Income','Cash Assit','Labor Force Participation'
            ,'Employment','Public Healthcare','Legal Perment Resident','Educational Pursuit','Work Status']

cat_df = df[cat_cols]
cat_df.columns = cat_cols_rename

const_df = df[const_cols]
const_df.columns = const_cols_rename

## Feature analysis
In this next section we wanted to take a look at a few feature in more detail. We are going to start with a close up on success balance by arrival year, survey year, and overall balance. We also want to showcase what countries and ethnicities are included in this dataset. 

In [None]:
# Overall the survey has decent representation of both success and unsuccessful resettlments
df['t_resettlement'].value_counts()

### Arrival and Survey Year Trends

In [None]:
# Success by year of survey
base = alt.Chart(df).encode(x=alt.X('qn1jyear:N',title='Arrival Year'),)

bar = base.mark_bar().encode(y=alt.Y('count(t_resettlement):Q', title='Count of Respondents'))
line =  base.mark_line(
    color='green',
    point={
      "filled": False,
      "fill": "white",
      "color": 'green' 
    }
).encode(
    y=alt.Y('mean(t_resettlement):Q', title='Resettlment Success Rate', axis=alt.Axis(format='%'))
)


(bar + line).resolve_scale(y='independent').properties(width=600, title="Count and success rate of resettlment by year")

By year of arrival, we see a decently normal distribution with most respondents arrive between 2014 and 2018. Considering this survey is conducted on refugees who have moved here in the past 5 years, and the most recent survey was conducted in 2019, this response distribution makes sense. Looking at resettlement success rates, we also see that it has remained flat throughout the years. Ideally we would like to see this trend have a positive slope, but we can feel reassured that it at least isn't decreasing.

In [None]:
# Success and count by survey year
base = alt.Chart(df).encode(x=alt.X('survey_year:N',title='Survey Year'),)

bar = base.mark_bar().encode(y=alt.Y('count(t_resettlement):Q', title='Count of Respondents'))
line =  base.mark_line(
    color='green',
    point={
      "filled": False,
      "fill": "white",
      "color": 'green' 
    }
).encode(
    y=alt.Y('mean(t_resettlement):Q', title='Resettlment Success Rate', axis=alt.Axis(format='%'))
)


(bar + line).resolve_scale(y='independent').properties(width=600, title="Count and success rate of resettlment by cohort")

As expected, the respondent represention by survey year is flat, and the success rate has remained stable year over year.

### Country and Ethnicity Makeup

In [None]:
# Count of country representation (breakdown by year)
alt.Chart(df).mark_bar().encode(
    x=alt.X('count(*):Q', title='Count of Respondents'),
    y=alt.Y('qn1h:N', title='Country'),
    color='survey_year:N'
).properties(
    width=700,
    height=500,
    title='Respondents by country and survey year'
)

There is a wide array of countries represented in our dataset, but the representation for each country is no where near flat, with Iraq having by far the most respondents. By year many of the countries have had consistent representation, but for a few countries like El Salvador and Ukraine, the number of respondents have been increasing in recent years.

In [None]:
# Count of country representation (breakdown by year)
alt.Chart(df).mark_bar().encode(
    x=alt.X('count(*):Q', title='Count of Respondents'),
    y=alt.Y('qn1i:N', title='Ethnicity'),
    color='survey_year:N'
).properties(
    width=700,
    height=500,
    title='Respondents by ethnicity and survey year'
)

Seeing a few ethnicities with only very monolithic survey years, we were curious if they were just missed in previous surveys, or if there has been an influx from those countries recently. To check this, we filter our dataframe down to just the ehtnicities of interst, and recreate the chart with year of arrival.


In [None]:
mono_df = df[df['qn1i'].isin(['fars','kurd','rohingya','siryac','tigrinya','tutsi','ukrainian'])]
alt.Chart(mono_df).mark_bar().encode(
    x=alt.X('count(*):Q', title='Count of Respondents'),
    y=alt.Y('qn1i:N', title='Ethnicity'),
    color='qn1jyear:N'
).properties(
    width=700,
    height=500,
    title='Respondents by ethnicity and arrival year'
)

It does appear that there was decent reprepsentation of many of these ethnicities within the US prior to the survey year that they started showing up. For many of these ethnicities there was a influx in the year or two immediately prior to the survey year they started seeing representation. Examples of this could be the Fars people seeing about half their respondents having arrived in 2016, while they saw all of their representation appear in 2017's survey. Similar cases can be found with the Ukranian and Tigrinya populations.

### Column Type overview
In this next section we take a very high level view at all of the numerical features. This section is more exploritory, and we just want to see if there are any trends worth digging more into.

#### Numrical columns visualization

In [None]:
fig, axes = plt.subplots(7,2,figsize=(12, 40))

sns.countplot(ax = axes[0,0], x=num_df['t_resettlement'], color = '#1f77b4')
sns.countplot(ax = axes[0,1], x=num_df['numppl'], color = '#1f77b4')

axs = axes.ravel()
i = 0
for axes in axs.flat:
    axes.set(ylabel='Count')
    
for col in num_cols_rename:
    x = num_df[col]
    filtered = x[x.between(0, x.quantile(.99))].astype(int) # Removing outliers for graph readability
    if col in ['t_resettlement','numppl']:
        axs[i].set_title("Count of " + col)
        i+=1
        continue
    else:
        axs[i].hist(filtered)
        axs[i].set_title("Count of " + col)

        if 'Mths' in col:
            axs[i].set(xlabel='Months')
        elif col in ['Housing','Salary_EST']:
            axs[i].set(xlabel='Dollars')
        else:
            axs[i].set(xlabel=col)

        i+=1

Nothing in these charts really jump out as anything worth digging into, with most being clustered around what we might consider a normal value, such as housing at 1,000, salary at 28,000, or years of school around 12. Looking at the spreads of how many months a respondent has used one form of goverment assistance or another, there is only significant numbers at either end of the 0-12 month range.

#### Categorical Columns Visualization

In [None]:
fig, axes = plt.subplots(17,1,figsize=(12, 65))

cat_df['House Ownship'].replace("owned by you or someone in this household with or without a mortgage or loan",\
                                "owned by someone in this household", inplace=True)
i = 0
for col in cat_cols_rename:
    cat_df[col].value_counts().plot.barh(ax=axes[i])
    i +=1

For categorical features, a few callouts come to mind. The gender representation in the dataset skews rather heavily toward males, and if it turns out that this feature has an impact on the outcome of our models, it might be worth looking more into. While the dataset unfortunaly doesn't go as granular as the state where the respondent resides, it does have the region, and there is a pretty even distribtion between the regions, aside from the north east which has consideribly less representation.

One interesting piece of information this survey collects is how well the respondent spoke english prior to entering the United States, and at the time of survey. Understanding if there is a correlation between an individual improving in this metric, and their success status is certainly worth looking into.

Finally, there is a drastic difference in the number of respondents renting versus owning their home, with nearly 6x as many renting. This is the opposite of the national numbers where only 36% of individuals rent the their home (https://www.pewresearch.org/short-reads/2021/08/02/as-national-eviction-ban-expires-a-look-at-who-rents-and-who-owns-in-the-u-s/). This is another feature whos impact might be worth looking at.

#### Constructed Columns Visualization

In [None]:
fig, axes = plt.subplots(9,1,figsize=(12, 40))

i = 0
for col in const_cols_rename:
    const_df[col].value_counts().plot.barh(ax=axes[i])
    i +=1

Constructed variables come from combinations of questions that were actually asked during the survey. For example, here is the definition of the Employment (ui_emprate) feature from the user guide:

*ui_emprate: This variable reports individuals’ employment status: employed, unemployed, not in the labor force, or doesn’t know or refused to respond. It was created using responses to qn5a and qn13. Individuals are considered “employed” if they report working at a job anytime the week before survey administration (qn5a), “unemployed” if they report not working at a job anytime the week before survey administration (qn5a) and looking for work during the four weeks before survey administration (qn13), and “not in the labor force” if they report not working at a job anytime the week before survey administration (qn5a) and either report not looking for work during the four weeks before survey administration, don’t know, or refuse to respond (qn13). Respondents who either don’t know or refuse to respond to qn5a are marked “Don’t know and/or refused” for ui_emprate.*

We aren't going to go to deep into analysis on these features as we view them more as the outcome of the survey, and what we based our success metric on, but there are a few quick takeaways worth mentioning. As reported, most of the respondents are active in the work force, and are currently working. Given their role in the work force, it isn't a surprise to see that very few are in school. Lastly, while most respondents recieve public assitance, this assistance ins't in the form of cash assistance, which includes Temporary Assistance for Need Families (TANF), Refugee Cash Assistance (RCA), Supplimental Security Income (SSI), and General Assistance (GA).

### Questions with Multiple Answers
These next features represent questions that allowed for multiple answers. In the data preperation phase, we combine the several columns that had each represented one possible answer to the question into one column, with a list of the anwers. Encoding each of these features into their own column makes the dataset far to wide considering the sample size. This width leads our models to overfit, so instead of including them in the model, we will address them here with some visualizations and analysis.

With ourselve familiarized with the dataset, and some specfic insights called out, we can move on to machine learning portion of this project where we aim to understand how we can use this data to improve the resettlment success rate.