# INTRO
This project is to identify if there is any relationship between years of experience and salary as well as proportions of users that use Python, R, and SQL.  We will be using Python and data received from Kaggle in 2021 to conduct these analysis.

Let us load in the data from the Kaggle csv

In [5]:
import csv

with open('kaggle2021-short.csv') as f:
    reader = csv.reader(f, delimiter=",")
    kaggle_data = list(reader)

column_names = kaggle_data[0]
survey_responses = kaggle_data[1:]

for survey in survey_responses:
    #For years_experience convert to float
    survey[0] = float(survey[0])
    #For Python convert to Bool
    if survey[1] == "TRUE":
        survey[1] = True
    else:
        survey[1] = False
    #For R convert to Bool
    if survey[2] == "TRUE":
        survey[2] = True
    else:
        survey[2] = False
    #For SQL convert to Bool
    if survey[3] == "TRUE":
        survey[3] = True
    else:
        survey[3] = False
    #For most_used convert to None
    if survey[4] == "NONE":
        survey[4] = None
    #For salary convert to int
    survey[-1] = int(survey[-1])

## Proportions
We will begin by comparing the proportions of users who use Python, R, and SQL.

In [6]:
count_python = 0
count_r = 0
count_sql = 0

for survey in survey_responses:
    #Count Python
    if survey[1] == True:
        count_python += 1
    #Count R
    if survey[2] == True:
        count_r += 1
    #Count SQL
    if survey[3] == True:
        count_sql += 1

proportion_python = count_python / len(survey_responses)
proportion_r = count_r / len(survey_responses)
proportion_sql = count_sql / len(survey_responses)

print("Number of Python users: " + str(count_python))
print("The proportion of Python users is: " + str(proportion_python * 100) + "%")
print("Number of R users: " + str(count_r))
print("The propotion of R users is: " + str(proportion_r * 100) + "%")
print("Number of SQL users: " + str(count_sql))
print("The propotion of SQL users is: " + str(proportion_sql * 100) + "%")

Number of Python users: 21860
The proportion of Python users is: 84.16432449081739%
Number of R users: 5335
The propotion of R users is: 20.540561352173413%
Number of SQL users: 10757
The propotion of SQL users is: 41.41608593539445%


Let us now look at the relationship between years of experience and compensation.  We would like to assume that compensation should increase with more years of experience.

In [16]:
#Lists to append experience and salary
experience_coding = []
compensation = []

#Adding data to lists
for survey in survey_responses:
    experience_coding.append(survey[0])
    compensation.append(survey[-1])

min_years_experience = min(experience_coding)
max_years_experience = max(experience_coding)
avg_years_experience = sum(experience_coding) / len(experience_coding)
print("The minimum experience is: " + str(min_years_experience) + " The maximum experience is: " + str(max_years_experience) + " The Average experience is: " + str(avg_years_experience))

min_compensation = min(compensation)
max_compensation = max(compensation)
avg_compensation = sum(compensation) / len(compensation)
print("The minimum compensation is: " + str(min_compensation) + " The maximum compensation is: " + str(max_compensation) + " The average compensation is: " + str(avg_compensation))


The minimum experience is: 0.0 The maximum experience is: 30.0 The Average experience is: 5.297231740653729
The minimum compensation is: 0 The maximum compensation is: 1492951 The average compensation is: 53252.81696377007


### GROUPING
We will now group the experience into 5 year divisions to conduct further analysis.

In [8]:
for survey in survey_responses:
    if survey[0] < 5:
        survey.append("Less than 5 years")
    elif survey[0] >= 5 and survey[0] < 10:
        survey.append("Between 5 and 10 years")
    elif survey[0] >= 10 and survey[0] < 15:
        survey.append("Between 10 and 15 years")
    elif survey[0] >= 15 and survey[0] < 20:
        survey.append("Between 15 and 20 years")
    elif survey[0] >= 20 and survey[0] < 25:
        survey.append("Between 20 and 25 years")
    else:
        survey.append("25+ years")


Let us compare the different experience groups.

In [22]:
less_than_five = []
less_than_five_sum = 0
five_between_ten = []
five_between_ten_sum = 0
ten_between_fifteen = []
ten_between_fifteen_sum = 0
fifteen_between_twenty = []
fifteen_between_twenty_sum = 0
twenty_between_twenty_five = []
twenty_between_twenty_five_sum = 0
twenty_five_plus = []
twenty_five_plus_sum = 0

for survey in survey_responses:
    if survey[-1] == "Less than 5 years":
        less_than_five.append("Less than 5 years")
        less_than_five_sum += survey[5]
    elif survey[-1] == "Between 5 and 10 years":
        five_between_ten.append("Between 5 and 10 years")
        five_between_ten_sum += survey[5]
    elif survey[-1] == "Between 10 and 15 years":
        ten_between_fifteen.append("Between 10 and 15 years")
        ten_between_fifteen_sum += survey[5]
    elif survey[-1] == "Between 15 and 20 years":
        fifteen_between_twenty.append("Between 15 and 20 years")
        fifteen_between_twenty_sum += survey[5]
    elif survey[-1] == "Between 20 and 25 years":
        twenty_between_twenty_five.append("Between 20 and 25 years")
        twenty_between_twenty_five_sum += survey[5]
    else:
        twenty_five_plus.append("25+ years")
        twenty_five_plus_sum += survey[5]
        
print(str(len(less_than_five)) + " people less than 5 years")
print(str(len(five_between_ten)) + " people between 5 - 10 years")
print(str(len(ten_between_fifteen)) + " people between 10 and 15 years")
print(str(len(fifteen_between_twenty)) + " people between 15 and 20 years")
print(str(len(twenty_between_twenty_five)) + " people between 20 and 25 years")
print(str(len(twenty_five_plus)) + " people with more than 25 years")
#Calculate averages
below_five_avg = less_than_five_sum / len(less_than_five)
between_five_and_ten_avg = five_between_ten_sum / len(five_between_ten)
between_ten_and_fifteen_avg = ten_between_fifteen_sum / len(ten_between_fifteen)
between_fifteen_and_twenty_avg = fifteen_between_twenty_sum / len(fifteen_between_twenty)
between_twenty_and_twenty_five_avg = twenty_between_twenty_five_sum / len(twenty_between_twenty_five)
twenty_five_plus_avg = twenty_five_plus_sum / len(twenty_five_plus)
#Print out averages
print("People with less than 5 years experience have an average salary of $" +str(below_five_avg))
print("People between 5 and 10 years experience have an average salary of $" + str(between_five_and_ten_avg))
print("People between 10 and 15 years experience have an average salary of $" + str(between_ten_and_fifteen_avg))
print("People between 15 and 20 years experience have an average salary of $" + str(between_fifteen_and_twenty_avg))
print("People between 20 and 25 years experience have an average salary of $" + str(between_twenty_and_twenty_five_avg))
print("People with more than 25 years experience have an average salary of $" + str(twenty_five_plus_avg))

18753 people less than 5 years
3167 people between 5 - 10 years
1118 people between 10 and 15 years
1069 people between 15 and 20 years
925 people between 20 and 25 years
941 people with more than 25 years
People with less than 5 years experience have an average salary of $45047.87484669119
People between 5 and 10 years experience have an average salary of $59312.82033470161
People between 10 and 15 years experience have an average salary of $80226.75581395348
People between 15 and 20 years experience have an average salary of $75101.82694106642
People between 20 and 25 years experience have an average salary of $103159.80432432433
People with more than 25 years experience have an average salary of $90444.98512221042


## SUMMARY
Looking at our analysis, we notice that the largest distribution of experience is with people with less than 5 years of experience.  Based on this observation, we have higher accuracy for people with less experience.  Comparing the average salary across our categories, we can conclude at with more experience there is an increase in salary. We could potentially run the analysis in groups of ten years instead of the difference by 5 years and see similar results, noticing that the averages are similar to each other.