# Kaggle Data Science Survey
Guided project by Dataquest. Here are what we are going to do in this project:
 - Clean the dataset
 - Analyze the dataset
 - Learn about the relationship between years of experience and salary in the dataset
 

## Loading and Cleaning the Data

In [1]:
import csv

with open('kaggle2021-short.csv') as f:
    reader = csv.reader(f, delimiter=",")
    kaggle_data = list(reader)

column_names = kaggle_data[0]
survey_responses = kaggle_data[1:]

print(column_names)
print(survey_responses[0:2])


['experience_coding', 'python_user', 'r_user', 'sql_user', 'most_used', 'compensation']
[['6.1', 'TRUE', 'FALSE', 'TRUE', 'Scikit-learn', '124267'], ['12.3', 'TRUE', 'TRUE', 'TRUE', 'Scikit-learn', '236889']]


In [2]:
for response in survey_responses:
    # experience_coding
    response[0] = float(response[0])

    # python_user
    if response[1] == "TRUE":
        response[1] = True
    else:
        response[1] = False

    # r_user
    if response[2] == "TRUE":
        response[2] = True
    else:
        response[2] = False

    # sql_user
    if response[3] == "TRUE":
        response[3] = True
    else:
        response[3] = False

    # most_used
    if response[4] == "None":
        response[4] = None
    else:
        response[4] = response[4]

    # compensation
    response[5] = int(response[5])

print(survey_responses[0:2])


[[6.1, True, False, True, 'Scikit-learn', 124267], [12.3, True, True, True, 'Scikit-learn', 236889]]


## Counting People

In [3]:
python_users = 0
r_users = 0
sql_users = 0

for response in survey_responses:
    if response[1] == True:
        python_users += 1
    if response[2] == True:
        r_users += 1
    if response[3] == True:
        sql_users += 1

proportion_python = (python_users / len(survey_responses)) * 100
proportion_r = (r_users / len(survey_responses)) * 100
proportion_sql = (sql_users / len(survey_responses)) * 100

print(f'Python Users: {python_users}({proportion_python:.2f}%), R Users: {r_users}({proportion_r:.2f}%), SQL Users: {sql_users}({proportion_sql:.2f}%)')


Python Users: 21860(84.16%), R Users: 5335(20.54%), SQL Users: 10757(41.42%)


## Aggregating Information

In [4]:
years_of_experience = []
salary = []

for response in survey_responses:
    years_of_experience.append(response[0])
    salary.append(response[5])

minimum_years = min(years_of_experience)
maximum_years = max(years_of_experience)
average_years = sum(years_of_experience) / len(years_of_experience)
minimum_salary = min(salary)
maximum_salary = max(salary)
average_salary = sum(salary) / len(salary)

# years_of_experience
print(f'Minimum years of experience: {minimum_years}')
print(f'Maximum years of experience: {maximum_years}')
print(f'Average years in experience: {average_years:.2f}')

# salary
print(f'Minimum salary: ${minimum_salary}')
print(f'Maximum salary: ${maximum_salary}')
print(f'Average salary: ${average_salary:.0f}')


Minimum years of experience: 0.0
Maximum years of experience: 30.0
Average years in experience: 5.30
Minimum salary: $0
Maximum salary: $1492951
Average salary: $53253


## Categorizing Years of Experience

In [5]:
for response in survey_responses:
    if response[0] <= 6:
        response.append('less than 6 years')
    elif response[0] > 6 and response[0] <= 12:
        response.append('7-12 years')
    elif response[0] > 12 and response[0] <= 18:
        response.append('13-18 years')
    elif response[0] > 18 and response[0] <= 24:
        response.append('19-24 years')
    elif response[0] > 24:
        response.append('more than 24 years')

print(survey_responses[0:2])


[[6.1, True, False, True, 'Scikit-learn', 124267, '7-12 years'], [12.3, True, True, True, 'Scikit-learn', 236889, '13-18 years']]


## Distribution of Experience and Compensation

In [8]:
less_six = []
seven_twelve = []
thirteen_eighteen = []
nineteen_twentyfour = []
more_twentyfour = []

for response in survey_responses:
    if response[6] == 'less than 6 years':
        less_six.append(response[5])
    if response[6] == '7-12 years':
        seven_twelve.append(response[5])
    if response[6] == '13-18 years':
        thirteen_eighteen.append(response[5])
    if response[6] == '19-24 years':
        nineteen_twentyfour.append(response[5])
    if response[6] == 'more than 24 years':
        more_twentyfour.append(response[5])

len1 = len(less_six)
len2 = len(seven_twelve)
len3 = len(thirteen_eighteen)
len4 = len(nineteen_twentyfour)
len5 = len(more_twentyfour)

print(f'{len1} data professionals have experience less than 6 years.')
print(f'{len2} data professionals have experience for 7 to 12 years.')
print(f'{len3} data professionals have experience for 13 to 18 years.')
print(f'{len4} data professionals have experience for 19 to 24 years.')
print(f'{len5} data professionals have experience more than 24 years.')

average1 = sum(less_six) / len1
average2 = sum(seven_twelve) / len2
average3 = sum(thirteen_eighteen) / len3
average4 = sum(nineteen_twentyfour) / len4
average5 = sum(more_twentyfour) / len5

print()
print(f'Data professionals with less than 6 years of experience have an average salary of ${average1:.2f}.')
print(f'Data professionals with 7 - 12 years of experience have an average salary of ${average2:.2f}.')
print(f'Data professionals with 13 - 18 years of experience have an average salary of ${average3:.2f}.')
print(f'Data professionals with 19 - 24 years of experience have an average salary of ${average4:.2f}.')
print(f'Data professionals with more than 24 years of experience have an average salary of ${average5:.2f}.')


19508 data professionals have experience less than 6 years.
2899 data professionals have experience for 7 to 12 years.
1284 data professionals have experience for 13 to 18 years.
1191 data professionals have experience for 19 to 24 years.
1091 data professionals have experience more than 24 years.

Data professionals with less than 6 years of experience have an average salary of $45429.14.
Data professionals with 7 - 12 years of experience have an average salary of $64173.32.
Data professionals with 13 - 18 years of experience have an average salary of $76799.69.
Data professionals with 19 - 24 years of experience have an average salary of $91421.98.
Data professionals with more than 24 years of experience have an average salary of $94748.73.


## Summary of Findings

1. There are more data professionals that have experience less than 6 years than all the others that have more than 6 years of experience. This implies that most of the data professionals that took the survey were mostly young in the industry
2. The salary change across the years is even. The data professionals with longer years of experience have the highest compensation then gradually decreases across the years towards 0 years.