## Instructions {-}

1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity. 

2. Do not write your name on the assignment.

3. Write your code in the *Code* cells and your answer in the *Markdown* cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

4. Use [Quarto](https://quarto.org/docs/output-formats/html-basics.html) to print the *.ipynb* file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: `quarto render filename.ipynb --to html`. Submit the HTML file.

5. The assignment is worth 100 points, and is due on **22nd October 2023 at 11:59 pm**. There is an optional bonus question worth 10 points. You can score a maximum of 110 (out of 100) points.

6. You are **not allowed to use a `for` loop** or any other kind of loop in this assignment.

## C.1 GDP per capita & social indicators
Read the file *social_indicator.txt* with python. Set the first column as the index when reading the file. How many observations and variables are there in the data?

*(4 points)*

In [214]:
import pandas as pd

social_data = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-1/Data/social_indicator.txt", index_col="Index",sep="\t")

print(len(social_data))


155


### C.1.1
Which variables have the strongest and weakest correlations with GDP per capita? Note that `lifeFemale` and `lifeMale` are the female and male life expectancies respectively.

Note that only when the magnitude of the correlation is considered when judging a correlation as strong or weak.

*(4 points)*

In [215]:
# find correlations of each variable with gdpPerCapita
print(social_data.corr()['gdpPerCapita'].sort_values(ascending=False))

print("lifeFemale and lifeMale have the strongest correlations with gdpPerCapita and contraception and economicActivityFemale have the weakest correlations with gdpPerCapita")


gdpPerCapita              1.000000
lifeFemale                0.604029
lifeMale                  0.592267
economicActivityFemale    0.052964
contraception             0.004286
economicActivityMale     -0.167231
illiteracyFemale         -0.457012
illiteracyMale           -0.471689
totalfertilityrate       -0.548172
infantMortality          -0.584060
Name: gdpPerCapita, dtype: float64
lifeFemale and lifeMale have the strongest correlations with gdpPerCapita and contraception and economicActivityFemale have the weakest correlations with gdpPerCapita


### C.1.2
Does the male economic activity *(in the column `economicActivityMale`)* have a positive or negative correlation with GDP per capita? Did you expect the positive/negative correlation? If not, why do you think you are observing that correlation?

*(4 points)*

Male economic activity has a negative correlation with GDP per capita. I didn't expect this negative correlation but I believe I'm observing this correlation because nations with higher male economic activity relative to average are more unequal and thus less prosperous societies.

### C.1.3

What is the rank of the US amongst all countries in terms of GDP per capita? Which countries lie immediately above, and immediately below the US in the ranking in terms of GDP per capita? The country having the highest GDP per capita ranks 1.

Note that:

1. The US is mentioned as *United States* in the data. 

2. The country with the highest GDP per capita will have rank 1, the country with the second highest GDP per capita will have rank 2, and so on.

**Hint:** [rank()](https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html)

*(4 points)*

In [216]:
social_data['rank'] = social_data['gdpPerCapita'].rank(ascending=False)
print(social_data.loc[social_data['country'] == 'United States', 'rank'].values[0])
print(social_data.loc[social_data["rank"] == 7.0, "country"].values[0])
print(social_data.loc[social_data["rank"] == 9.0, "country"].values[0])

8.0
Norway
Brunei


### C.1.4

Which country or countries rank among the top 20 in terms of each of these social indicators - `economicActivityFemale`, `economicActivityMale`, `gdpPerCapita`, `lifeFemale`, `lifeMale`? For each of these social indicators, the country having the largest value ranks 1 for that indicator.

*(6 points)*

**Hint:** 

1. Use [rank()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html). Note that this method is different from the method given in the hint of the previous question. This method is of the DataFrame class, while the one in the previous question is of the Series class. *This part of the hint is just for your understanding. You don't need to write any code in this part.*

2. Using `rank()`, get the DataFrame consisting of the ranks of countries on each of the relevant social indicators *(one line of code)*.

3. In the DataFrame obtained in (2), filter the rows for which the maximum rank is less than or equal to 20 *(one line of code)*.

In [217]:
ranked_indicators = social_data[['economicActivityFemale', 'economicActivityMale', 'gdpPerCapita', 'lifeFemale', 'lifeMale']].rank(ascending=False)
top_countries = ranked_indicators[ranked_indicators.max(axis=1) <= 20]
print(social_data.loc[top_countries.index, ['country']].values[0])

['Iceland']


### C.1.5

On which social indicator among `economicActivityFemale`, `economicActivityMale`, `gdpPerCapita`, `lifeFemale`, `lifeMale`, `illiteracyFemale`, `illiteracyMale`, `infantMortality`, and `totalfertilityrate` does the US have its worst ranking, and what is the rank? Note that for `illiteracyFemale`, `illiteracyMale`, and `infantMortality`, the country having the lowest value will rank 1, in contrast to the other social indicators.

*(8 points)*

In [218]:
descending_ranked_columns = social_data[['illiteracyFemale', 'illiteracyMale', 'infantMortality']].rank(ascending=False)
ascending_ranked_columns = social_data[["totalfertilityrate"]].rank(ascending=True)
all_indicators = pd.concat([ranked_indicators, descending_ranked_columns, ascending_ranked_columns], axis=1)

us_index = social_data.loc[social_data['country'] == 'United States'].index
us_rankings = all_indicators.loc[us_index, :]
us_rankings[["illiteracyFemale", "illiteracyMale", "infantMortality"]] = all_indicators[["illiteracyFemale", "illiteracyMale", "infantMortality"]].max().max() - us_rankings[["illiteracyFemale", "illiteracyMale", "infantMortality"]]


print(us_rankings.idxmax(axis=1).values[0])

economicActivityMale


### C.1.6

Find all the countries that have a lower GDP per capita than the US, despite having lower illiteracy rates (for both genders), higher economic activity (for both genders), higher life expectancy (for both genders), and lower infant mortality rate than the US?

*(6 points)*

In [219]:
us_gdp = social_data.loc[social_data['country'] == 'United States', 'gdpPerCapita'].values[0]
us_male_illiteracy = social_data.loc[social_data['country'] == 'United States', 'illiteracyMale'].values[0]
us_female_illiteracy = social_data.loc[social_data['country'] == 'United States', 'illiteracyFemale'].values[0]
us_male_economic_activity = social_data.loc[social_data['country'] == 'United States', 'economicActivityMale'].values[0]
us_female_economic_activity = social_data.loc[social_data['country'] == 'United States', 'economicActivityFemale'].values[0]
us_life_fem = social_data.loc[social_data['country'] == 'United States', 'lifeFemale'].values[0]
us_life_male = social_data.loc[social_data['country'] == 'United States', 'lifeMale'].values[0]
us_infant_mortality = social_data.loc[social_data['country'] == 'United States', 'infantMortality'].values[0]


conditions = (
    (social_data['gdpPerCapita'] < us_gdp) & 
    (social_data['illiteracyFemale'] < us_female_illiteracy) & 
    (social_data['illiteracyMale'] < us_male_illiteracy) & 
    (social_data['economicActivityFemale'] > us_female_economic_activity) & 
    (social_data['economicActivityMale'] > us_male_economic_activity) & 
    (social_data['lifeFemale'] > us_life_fem) & 
    (social_data['lifeMale'] > us_life_male) & 
    (social_data['infantMortality'] < us_infant_mortality)
)

print(social_data.loc[conditions, "country"].values)

['Iceland' 'Sweden' 'Netherlands']


## C.2 GDP per capita vs social indicators
We'll use the same data as in in the previous question. For the questions below, assume that all numeric columns, **except GDP per capita**, are social indicators.

###  C.2.1
Use the column `geographic_location` to create a new column called `continent`. Merge the values of the `geographic_location` column appropriately to obtain 6 distinct values for the `continent` column – *Asia, Africa, North America, South America, Europe* and *Oceania*. Drop the column `geographic_location`. Print the first 5 observations of the updated DataFrame.

*(8 points)*

**Hint:**

1. Use `value_counts()` to see the values of `geographic_location`. The code `if 'Asia' in 'something'` will return `True` if '*something*' contains the string '*Asia*', for example, if '*something*' is '*North Asia*', the code with return True. *This part of the hint is just for your understanding. You don't need to write any code in this part.*

2. Apply a lambda function on the Series `geographic_location` to replace a string that contains '*Asia*' with '*Asia*', replace a string that contains '*Europe*' with '*Europe*', and replace a string that contains '*Africa*' with '*Africa*'. *This will be a single line of code.*

3. Rename the column `georgaphic_location` to `continent`.

In [220]:
social_data["geographic_location"] = social_data["geographic_location"].apply(lambda x: "Asia" if "Asia" in x  else
                                                                              "Africa" if "Africa" in x else
                                                                              "Europe" if "Europe" in x  else
                                                                              "North America" if "North America" in x else
                                                                              "Oceania" if "Oceania" in x  else
                                                                              "South America" if "South America" in x else x)

# rename geographic_location column to "continent"
social_data = social_data.rename(columns={"geographic_location": "continent"})
social_data["continent"].head(5)

Index
1        Asia
9        Asia
124    Africa
3        Asia
11       Asia
Name: continent, dtype: object

### C.2.2
Sort the column labels lexicographically. Drop the columns `region` and `contraception`. Print the first 5 observations of the updated DataFrame.

**Hint:** [sort_index()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html)

*(4 points)*

In [221]:
sorted_social_data = social_data.sort_index(axis = 1)
sorted_social_data = sorted_social_data.drop(columns = ["region", "contraception"])
print(sorted_social_data.head(5))

      continent      country  economicActivityFemale  economicActivityMale  \
Index                                                                        
1          Asia        Yemen                     1.9                  80.6   
9          Asia         Oman                    15.8                  84.1   
124      Africa     Ethiopia                    58.4                  84.7   
3          Asia  Afghanistan                     7.2                  87.5   
11         Asia     Maldives                    20.2                  77.3   

       gdpPerCapita  illiteracyFemale  illiteracyMale  infantMortality  \
Index                                                                    
1              1924            69.552          32.406               80   
9             30404             7.310           3.010               25   
124            2973            74.700          54.500              107   
3              2474            85.000          52.800              154   
11       

### C.2.3

Find the percentage of the total countries in each continent.

**Hint:** One line of code with `value_counts()` and `shape`

*(4 points)*

In [222]:
100*(social_data.value_counts("continent")/social_data.shape[0])

continent
Asia             26.451613
Europe           22.580645
Africa           20.000000
North America    13.548387
Oceania           9.677419
South America     7.741935
dtype: float64

### C.2.4

Which country has the highest GDP per capita? Let us call it country $G$.

*(4 points)*

In [223]:
max_gdp = social_data["gdpPerCapita"].max()
country_g = social_data[social_data["gdpPerCapita"] == max_gdp]["country"] 
print(country_g.values[0]) 

Luxembourg


### C.2.5
We need to find the African country that is the closest to country $G$ with regard to social indicators. Perform the following steps:

#### C.2.5.1	
Standardize each of the social indicators to a standard normal distribution so that all of them are on the same scale (remember to exclude *GDP per capita* from social indicators).

**Hint:**

1. For scaling a random variable to standard normal, subtract the mean from each value of the variable, and divide by its standard deviation.

2. Use the apply method with a lambda function to scale all the social indicators to standard normal.

3. The above (1) and (2) together is a single line of code.

*(6 points)*

In [224]:
social_indicators = ["economicActivityFemale", 
    "economicActivityMale", 
    "illiteracyFemale", 
    "illiteracyMale", 
    "infantMortality", 
    "lifeFemale", 
    "lifeMale", 
    "totalfertilityrate"]

normalized_social_data = social_data[social_indicators].apply(lambda x: (x - x.mean()) / x.std())
print(normalized_social_data) 



       economicActivityFemale  economicActivityMale  illiteracyFemale  \
Index                                                                   
1                   -2.608875              0.521297          1.910795   
9                   -1.794841              0.975454         -0.560149   
124                  0.699967              1.053310          2.115165   
3                   -2.298488              1.416635          2.524065   
11                  -1.537162              0.093092         -0.572456   
...                       ...                   ...               ...   
74                   0.049911             -1.100693         -0.810650   
97                   0.284166             -1.567826         -0.833199   
110                  0.448144             -0.659511         -0.842409   
41                  -0.553294             -1.749488         -0.695523   
36                  -0.693846             -1.944127         -0.755071   

       illiteracyMale  infantMortality  lifeFemale

#### C.2.5.2
Compute the [Manhattan distance](https://en.wikipedia.org/wiki/Taxicab_geometry) between country $G$ and each of the African countries, based on the scaled social indicators.

**Hint:** 

1. Broadcast a Series to a DataFrame

2. The Manhattan distance between two points $(x_1, x_2, ..., x_p)$ and $(y_1, y_2, ..., y_p)$ is $|x_1 - y_1| + |x_2 - y_2| + ... + |x_p-y_p|$, where $|.|$ stands for absolute value (for example, $|-2| = 2;  |3| = 3$).

*(8 points)*

In [225]:
african_rows = normalized_social_data.loc[social_data['continent'] == "Africa", social_indicators]
g_rows = normalized_social_data.loc[social_data['country'] == country_g.values[0], social_indicators]
manhattan_distances = african_rows[social_indicators].apply(
    lambda x: (x - g_rows.values.flatten()).abs().sum(), axis=1
)
print(manhattan_distances)


Index
124    21.506361
146    22.487953
68     20.935868
151    25.148325
155    24.333913
94     17.169729
148    26.037851
154    23.101754
42     15.264537
88     13.552373
122    19.562837
16     18.630221
18     20.222236
59     15.882509
66     18.348685
57     18.600012
142    18.662018
84     10.853300
73     12.812347
19     14.962067
23      9.734368
55     12.747434
4      10.517160
93      9.088161
51      9.364599
13     10.984499
17     11.300956
12      8.260904
118     6.372256
45      5.051104
56      3.183005
dtype: float64


#### C.2.5.3
Identify the African country, say country $A$, with the least Manhattan distance to country $G$.

*(8 points)*

In [226]:
min_index = manhattan_distances.argmin()
country_a = social_data.loc[social_data['continent'] == "Africa", "country"].iloc[min_index]
print(country_a)

Reunion


### C.2.6
Find the correlation between the Manhattan distance from country $G$ and GDP per capita for African countries.

*(6 points)*

In [227]:
african_subset = social_data[social_data["continent"] == "Africa"]
manhattan_distances.corr(african_subset["gdpPerCapita"])


-0.7126967474772727

### C.2.7
Based on the correlation coefficient in $2(f)$, do you think African countries should try to emulate the social characteristics of country $G$? Justify your answer.

*(4 points)*

There is a strong negative correlation between Manhattan distance from country G and GDP per capita of African countries (-0.71). This means that the closer an African country is from the composite z-scores of the social characteristics of country_g (Luxembourg), the higher their GDP per capita tends to be. Thus, African countries should try to emulate the social characteristics of country G as changing their social characteristics to more closely resemble those of country g's would likely increase their GDP per capita,

## C.3 Medical data

Read the data sets *conditions.csv and patients.csv*. Suppose we are interested in studying patients with prediabetes condition. Do not drop or compute any missing values. In *condition.csv*, the patient IDs are stored in column `PATIENT`, and the medical conditions are stored in column `DESCRIPTION`. In *patient.csv*, the patient IDs are stored in column `Id`.

In [228]:
medical_data = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-1/Data/patients.csv")
conditions_data = pd.read_csv("/Users/vaibhavrangan/Downloads/Stat_303-1/Data/conditions.csv")
conditions_data.columns

Index(['START', 'STOP', 'PATIENT', 'ENCOUNTER', 'CODE', 'DESCRIPTION'], dtype='object')

### C.3.1
Print the patient IDs of all the patients with prediabetes condition.

*(4 points)*

In [229]:
conditions_data[conditions_data["DESCRIPTION"] == "Prediabetes"]["PATIENT"]

40     e9922ddd-8e85-9837-9ac8-885c34075450
112    6d95ba48-a8be-b4c8-cdf6-f3d89142217b
170    2599cde2-f690-8596-3cb5-905d701f6987
182    527c52ef-be00-25b5-823b-b87005bf0d5f
223    f3d4b3f5-5ca8-d376-c3ae-f61c93b1d130
Name: PATIENT, dtype: object

### C.3.2
Make a subset of the data with only prediabetes patients. How many prediabetes patients are there?

*(4 points)*

**Hint:** `.isin()`

In [230]:
# make a subset of medical_data with only those patients with a condition of Prediabetes
prediabetes_patients = medical_data[medical_data["Id"].isin(conditions_data[conditions_data["DESCRIPTION"] == "Prediabetes"]["PATIENT"])]
print(len(prediabetes_patients))

5


### C.3.3
What proportion of the total `HEALTHCARE_EXPENSES` of all the patients correspond to the `HEALTHCARE_EXPENSES` of prediabetes patients.

*(4 points)*

In [231]:
total_expenses = medical_data["HEALTHCARE_EXPENSES"].sum()
prediabetes_expenses = prediabetes_patients["HEALTHCARE_EXPENSES"].sum()
print(prediabetes_expenses / total_expenses)

0.4609622337553665


## C.4 Bonus question
This is an optional question with no partial credit. You will get points only if your solution is completely correct. We advise you to attempt it only when you are done with the rest of the assignment.

*(10 points)*

Read the file *STAT303-1 survey for data analysis.csv*. In this question, we'll work to clean this data a bit. As with every question, you are **not** allowed to use a `for` loop or any other loop.

Execute the following code to read the data and clean the column names.

In [232]:
survey_data = pd.read_csv('/Users/vaibhavrangan/Downloads/Stat_303-1/Data/STAT303-1 survey for data analysis.csv')
new_col_names = ['parties_per_month', 'smoke', 'weed', 'introvert_extrovert', 'love_first_sight', 'learning_style', 'left_right_brained', 'personality_type', 'social_media', 'num_insta_followers', 'streaming_platforms', 'expected_marriage_age', 'expected_starting_salary', 'fav_sport', 'minutes_ex_per_week', 'sleep_hours_per_day', 'how_happy', 'farthest_distance_travelled', 'fav_number', 'fav_letter', 'internet_hours_per_day', 'only_child', 'birthdate_odd_even', 'birth_month', 'fav_season', 'living_location_on_campus', 'major', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age', 'height', 'height_father', 'height_mother', 'school_year', 'procrastinator', 'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before', 'dominant_hand', 'childhood_in_US', 'gender', 'region_of_residence', 'political_affliation', 'cant_change_math_ability', 'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']
survey_data.columns = list(survey_data.columns[0:2])+new_col_names

Check the datatype of the variables using the `dtypes` attribute of the Pandas DataFrame object. You will notice that only two variables are numeric. However, if you check the first few observations of the data with the function `head()` you will find the there are several more variables that seem to have numeric values.

In [233]:
survey_data.dtypes
survey_data.head()

Unnamed: 0,Timestamp,fav_alcohol,parties_per_month,smoke,weed,introvert_extrovert,love_first_sight,learning_style,left_right_brained,personality_type,...,used_python_before,dominant_hand,childhood_in_US,gender,region_of_residence,political_affliation,cant_change_math_ability,can_change_math_ability,math_is_genetic,much_effort_is_lack_of_talent
0,2022/09/13 1:43:34 pm GMT-5,I don't drink,1,No,Occasionally,Introvert,No,Visual (learn best through images or graphic o...,"Left-brained (logic, science, critical thinkin...",INFJ,...,Yes,Right,Yes,Female,Northeast,Democrat,Disagree,Agree,Disagree,Disagree
1,2022/09/13 5:28:17 pm GMT-5,Hard liquor/Mixed drink,3,No,Occasionally,Extrovert,No,Visual (learn best through images or graphic o...,"Left-brained (logic, science, critical thinkin...",ESFJ,...,Yes,Right,Yes,Male,West,Democrat,Disagree,Agree,Disagree,Disagree
2,2022/09/13 7:56:38 pm GMT-5,Hard liquor/Mixed drink,3,No,Yes,Introvert,No,Kinesthetic (learn best through figuring out h...,"Left-brained (logic, science, critical thinkin...",ISTJ,...,No,Right,No,Female,International,No affiliation,Disagree,Agree,Disagree,Disagree
3,2022/09/13 10:34:37 pm GMT-5,Hard liquor/Mixed drink,12,No,No,Extrovert,No,Visual (learn best through images or graphic o...,"Left-brained (logic, science, critical thinkin...",ENFJ,...,No,Right,Yes,Female,Southeast,Democrat,Disagree,Agree,Disagree,Disagree
4,2022/09/14 4:46:19 pm GMT-5,I don't drink,1,No,No,Extrovert,Yes,Reading/Writing (learn best through words ofte...,"Right-brained (creative, art, imaginative, int...",ENTJ,...,No,Right,Yes,Female,Northeast,Democrat,Agree,Disagree,Disagree,Disagree


Write a function that accepts a Pandas Series *(or a column of a Pandas DataFrame object)* as argument, and if the datatype of the Series is non-numeric, does the following:

1. Checks if at least 10 values of the Series contain a digit in them.

2. If at least 10 values are found to contain a digit, then:

    A. Eliminate the characters `~`, `+`, and `,` from all the values of the Series.
    
    B. Convert the Series to numeric (with coercion if needed).
    
3. If at least 10 values are NOT found to contain a digit, then:

    A. If the values of the Series are *'Yes'* and *'No'*, then replace *'Yes'* with *1* and *'No'* with *0*. The Series datatype must change to numeric as well.
    
    B. If the values of the Series are *'Agree'* and *'Disagree'*, then replace *'Agree'* with *1* and *'Disagree'* with *0*. The Series datatype must change to numeric as well.
    
Apply the function to each column of `survey_data` using the *apply()* method. Save the updated DataFrame as `survey_data_clean`. Then, execute the following code.

In [234]:
def change_values(df):
    # change all values to string
    df = df.astype(str)
    # check if at least 10 values of df contains a digit
    if len(df[df.str.isdigit()]) >= 10:
        # eliminate characers ~, +, and , from all values of the Series
        df = df.str.replace("~", "")
        df = df.str.replace("+", "")
        df = df.str.replace(",", "")
        # convert series to numeric using coercion
        df = df.apply(pd.to_numeric, errors='coerce')
    else:
        # check if values of series are "Yes" and "No"
        if len(df[df == "Yes"]) >= 1 and len(df[df == "No"]) >= 1:
            df = df.apply(lambda x: 1 if x == "Yes" else 0)
            df = pd.to_numeric(df)
        elif len(df[df == "Agree"]) >= 1 and len(df[df == "Disagree"]) >= 1:
            df = df.apply(lambda x: 1 if x == "Agree" else 0)
            df = pd.to_numeric(df)
    # return numeric columns of df
    return df

survey_data_clean = survey_data.apply(change_values)
survey_data_clean.describe().loc['mean',:]



parties_per_month                    5.052632
smoke                                0.036458
weed                                 0.166667
love_first_sight                     0.281250
num_insta_followers                969.211957
expected_marriage_age               29.185792
expected_starting_salary         85393.478723
minutes_ex_per_week                205.778378
sleep_hours_per_day                  7.223958
farthest_distance_travelled       6283.676271
fav_number                          37.978947
internet_hours_per_day              11.196809
only_child                           0.151042
num_majors_minors                    2.505263
high_school_GPA                      6.704000
height                              67.660495
height_father                       69.073518
height_mother                       63.377369
procrastinator                       0.640625
num_clubs                            2.593750
student_athlete                      0.015625
AP_stats                          

 The above code should print out the mean values of 28 numeric columns in the data `survey_data_clean`.

The variables that you should see as numeric in `survey_data_clean` are given in the list `numeric_columns` below *(this is just to check your work)*:

In [235]:
numeric_columns = ['parties_per_month', 'love_first_sight', 'num_insta_followers',
       'expected_marriage_age', 'expected_starting_salary',
       'minutes_ex_per_week', 'sleep_hours_per_day',
       'farthest_distance_travelled', 'fav_number', 'internet_hours_per_day',
       'only_child', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age',
       'height', 'height_father', 'height_mother', 'procrastinator',
       'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before',
       'childhood_in_US', 'cant_change_math_ability',
       'can_change_math_ability', 'math_is_genetic',
       'much_effort_is_lack_of_talent']

Note that your **function must be general**, i.e., it must work for any other dataset as well. This means, you cannot hard code a column name, or anything specific to `survey_data` in the function.