# An exploration of the cities I would find desirable to live in

This is a simple beginners' project to practice using the pandas and seaborn libraries. The objective is to find narrow down a list of cities with my personal preferences to find where I would most like to live. If you have any suggestions for improvements, they would be much appreciated! 

In [None]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Import "quality of life" data and reformat country column by removing whitespace on lhs.

In [None]:
quality_of_life_filepath = "../input/city-quality-of-life-dataset/uaScoresDataFrame.csv"
quality_of_life_data = pd.read_csv(quality_of_life_filepath,index_col=0)
quality_of_life_data['UA_Country'] = quality_of_life_data['UA_Country'].str.lstrip()

In [None]:
quality_of_life_data.head()

Plot of the "cost of living" for all the data split across the continents. All the columns in the "quality of life" data are range from 0.0 - 10.0 with the larger numbers indicating higher desirability. For cost of living, we can see that the cities in Europe, North America and Asia have a large range in the "cost of living" whereas, Oceania is mid-range and South America and Africa are on average cheaper.    

In [None]:
plt.figure(figsize=(10,6))
sns.swarmplot(x=quality_of_life_data["UA_Continent"],y=quality_of_life_data["Cost of Living"], s=6)

I calculated a score "Score_1" by weighting different columns from the "quality of life" data on how important each variable is to me. 

In [None]:
quality_of_life_data['Score_1'] = quality_of_life_data['Housing'] + quality_of_life_data['Cost of Living'] + (quality_of_life_data['Travel Connectivity']*0.9) + (quality_of_life_data['Safety']*0.9) + (quality_of_life_data['Healthcare']*0.8) + (quality_of_life_data['Education']*0.5) + (quality_of_life_data['Environmental Quality']*0.8) + (quality_of_life_data['Internet Access']*0.8) + (quality_of_life_data['Economy']*0.5) + (quality_of_life_data['Taxation']*0.5)+ (quality_of_life_data['Leisure & Culture']*2)+ (quality_of_life_data['Tolerance']*3) 

Plotting the distribution of scores across the continents, I can see that cities in Europe are on average scoring higher than the other continents and are more numerous. This is lucky as being European, I have a preference to stay within Europe.

In [None]:
plt.figure(figsize=(10,6))
sns.swarmplot(x=quality_of_life_data["UA_Continent"],y=quality_of_life_data["Score_1"], s=6)

I limit the "quality of life" data down to Europe.

In [None]:
Europe_quality = quality_of_life_data.loc[quality_of_life_data.UA_Continent == 'Europe']
Europe_quality.head()

In [None]:
Europe_quality_scored=Europe_quality.sort_values(by=['Score_1'],ascending=False)
Europe_quality_scored[0:10]

I would like to live in a reasonably warm climate so import some data on the temperature of different cities and I separate out the Euopean cities.

In [None]:
world_temp_filepath = '../input/world-average-temperature/Avg_World_Temp_2020.csv'
world_temp_data = pd.read_csv(world_temp_filepath,index_col = 0)

Europe_temp = world_temp_data.loc[world_temp_data.Continent == "Europe"]
Europe_temp.replace(['North Macedonia'],['Macedonia'])

I take the mean of hottest two months of the year, July and August, to be the summer temperature and the mean of the coldest two, January and February, to be the winter temperature. 

I plot the average summer temperature against average winter temperature for these european cities and find a positive correlation, where cities with hotter summers tended to also have warmer winters.

In [None]:
summer_av = Europe_temp.loc[:,['Jul','Aug']].mean(axis=1).to_frame(name="summer_av") 
winter_av = Europe_temp.loc[:,['Jan','Feb']].mean(axis=1).to_frame(name="winter_av") 

#Europe_temp
Europe_temp_score = Europe_temp.iloc[:,[0,1]]
Europe_temp_score = Europe_temp_score.join([summer_av,winter_av])
#Europe_temp_score

sns.scatterplot(x= Europe_temp_score['winter_av'], y= Europe_temp_score['summer_av'])
sns.regplot(x= Europe_temp_score['winter_av'], y= Europe_temp_score['summer_av'])
plt.title("Average summer and winter temperatures of European countries /\xb0C")

In [None]:
#Europe_temp['summer'] = Europe_temp.loc[:,['Jul','Aug']].mean(axis=1)
def summerscalefunc(row): 
    if row.summer_av > 27:
        return 4
    elif row.summer_av > 25:
        return 7
    elif row.summer_av > 21:
        return 10
    elif row.summer_av > 18:
        return 7
    elif row.summer_av > 15:
        return 5
    else:
         return 0
        
def winterscalefunc(row): 
    if row.winter_av <= -10:
        return 0
    elif row.winter_av <= -5:
        return 1
    elif row.winter_av <= 0:
        return 2
    elif row.winter_av <= 3:
        return 10
    elif row.winter_av <= 5:
        return 8
    else:
        return 6
summer_score = Europe_temp_score.apply(summerscalefunc, axis='columns').to_frame(name="summer_score")
winter_score = Europe_temp_score.apply(winterscalefunc, axis='columns').to_frame(name="winter_score")

Europe_temp_score = Europe_temp_score.join([summer_score,winter_score])
#Europe_temp_score

In [None]:
Europe_temp_score.loc[Europe_temp_score.Country == 'United Kingdom']

I grouped the data by country and found the mean temperature score for each country in the summer and winter and created a dictionary. I used the dictionary values to add these tempertaure scores to the rest of my data in the Europe?quality dataframe. 

In [None]:
Country_T_summer = Europe_temp_score.groupby(['Country']).summer_score.mean().to_dict()
Country_T_winter = Europe_temp_score.groupby(['Country']).winter_score.mean().to_dict()

Europe_quality['summer_temp'] = Europe_quality['UA_Country'].map(Country_T_summer)
Europe_quality['winter_temp'] = Europe_quality['UA_Country'].map(Country_T_winter)


I imported some UNESCO data as this is a good measure of cultural and natural sights in a country.

In [None]:
unesco_filepath = "../input/unesco-world-heritage-sites/whc-sites-2019.csv"
unesco_data = pd.read_csv(unesco_filepath,index_col="id_no")

Split country column such that when a UNESCO site is shared by multiple countries, each country had it's own entry. Then I counted the number of sites for each country and saved this as a dictionary which I used to add this data to the Europe_quality dataframe.

In [None]:
unesco_data_split = unesco_data.assign(states_name_en = unesco_data.states_name_en.str.split(',')).explode('states_name_en')
unesco_europe_NA_split = unesco_data_split.loc[unesco_data_split.region_en == "Europe and North America"]
unesco_europe_NA_split = unesco_europe_NA_split.replace(['United Kingdom of Great Britain and Northern Ireland','North Macedonia','Russian Federation','Republic of Moldova'],['United Kingdom','Macedonia','Russia','Moldova'])

unesco_counts = unesco_europe_NA_split.states_name_en.value_counts()
unesco_europe_D = unesco_counts.to_dict()

Europe_quality['unesco'] = Europe_quality['UA_Country'].map(unesco_europe_D)

I assigned a score for the unesco data by scaling from 10-0 for the most sites to least. 

In [None]:
Europe_quality['unesco_scaled'] = Europe_quality.unesco.map(lambda u: u/55 * 10)

I added new parameters to Score_1 to create Score_2.

In [None]:
Europe_quality['Score_2'] = Europe_quality['Score_1'] + Europe_quality['unesco_scaled'] + Europe_quality['summer_temp'] + Europe_quality['winter_temp'] 
Europe_quality_scored = Europe_quality.sort_values(by=['Score_2'],ascending=False)
Europe_quality_scored[0:10]

Salaries also fluctuate between countries so I imported some data on average salaries in the EU.

In [None]:
salaries_filepath = "../input/average-eu-salaries/EU_av_salaries.csv"
salaries = pd.read_csv(salaries_filepath,encoding='latin1')

salaries.rename(columns={'Gross Salary euro': 'Gross', 'Net Salary euro': 'Net'}, inplace=True)

Plot distribution of average national gross and net salaries for european countries. You can see that the tax taken from the gross to give the net, narrows the distribution, showing the tax applied by the countries differ. So I will focus on the gross salary since I already have a tax variable in Score_1.  

In [None]:
sns.kdeplot(data = salaries['Gross'],shade = True, label = 'Gross')
sns.kdeplot(data = salaries['Net'],shade = True, label = 'Net')
plt.title("Distribution of average gross and net salaries of European countries")
plt.legend()
plt.xlabel("Average salary") 

I create a dictionary of average monthly gross salaries by country and add this data to the Europe_quality dataframe.  

In [None]:
monthly_gross = pd.Series(salaries.Gross.values, salaries.Country).to_dict()

Europe_quality['salary'] = Europe_quality['UA_Country'].map(monthly_gross)

I plot a chart of salary against Score_2, which also distingues whether there is my ideal climate. There is a slight correlation between the happiness score and the salary (money does buy happiness to some extent) but with a large confidence level. 

In [None]:
sns.scatterplot(x= Europe_quality['Score_2'], y= Europe_quality['salary']) 
plt.title('Average monthly gross salary against my happiness score of European countries')

Create a dataframe of desirable cities by limiting the data to those who score above 100 on the happiness score and with a monthly gross national salary of above 2400 euros.  

In [None]:
desirable_cities = Europe_quality.loc[(Europe_quality.Score_2 > 100) & (Europe_quality.salary >= 2400)].sort_values(by = ['Score_2'], ascending = False).reset_index()
desirable_cities[0:10]

I notice there are still some cities that are too cold for my liking in the list of the most desiable so I create a variable defining whether a country has a good climate or not based on the scores it attained for the summer and temperature. Warm in the summer but still with a proper winter, but not too much sub-zero! 

In [None]:
Europe_quality['good climate'] = (Europe_quality['summer_temp'] > 5) & (Europe_quality['winter_temp'] >2)

Final list of my most desirable cities are given here!

In [None]:
desirable_cities_good_climate = Europe_quality.loc[(Europe_quality.Score_2 >= 100) & (Europe_quality.salary >= 2400)& (Europe_quality['good climate'] == True)].sort_values(by = ['Score_2'], ascending = False).reset_index()
desirable_cities_good_climate[0:10]