# OBTENCION DE DATOS
**IMPORTAR LIBRERIAS Y OBTENER EL DATASET**

En primer lugar, importaremos las siguientes librerias necesarias para el analisis y procederemos a leer y mostrar el dataset.

In [1]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from IPython.display import display

pd.set_option('display.max_columns', None)

In [None]:
url ='https://raw.githubusercontent.com/simonzanetti/2023.2-SysArmy-IT-Salaries-Survey/main/dataset.csv'
df = pd.read_csv(url)

display(df.head(1))

#print("Shape: " + str(df.shape))
#print("\nColumn Names and Types: ")
#print(df.dtypes)

# PREPROCESAMIENTO
**ORDENAR COLUMNAS**

El primer paso de nuestra limpieza sera traducir al ingles el nombre de las columnas y eliminar aquellas que no sean relevantes para el analisis. Tambien ordene las columnas en las categorias:
- Empleado
- Trabajo
- Empresa
- Sueldo
- Herramientas
- Estudios

Las columnas eliminadas fueron:
- 'work_country': La totalidad de los encuestados es de Argentina
- 'ARS/USD_exchange': No solo que el porcentaje de los que cobra parte o la totalidad de su sueldo en dolares es pequeño, sino tambien que existen varios tipos de cotizacion y no se aclara cual se usa en cada caso, ademas que la cotizacion suele variar segun las semanas, horas o dias, por lo que es una variable inestable y he decidido eliminarla.
- 'is:number': Una variable que no vamos a utilizar.

In [4]:
columns_names = ['work_country','work_province','work_dedication','work_contract_type',
                'last_month_gross_salary','last_month_net_salary','salary_in_usd',
                'ARS/USD_exchange','salary_has_bonus','salary_bonus_tied_to',
                '2023_salary_adjusment_times','percentage_adjustment',
                'last_adjustment_month','last_semester_salary_comparison',
                'work_benefits','salary_satisfaction','work_title','years_experience',
                'years_in_company','years_in_position','people_in_charge','platforms',
                'programming_languages','frameworks','databases','qa_testing_tools',
                'company_size','work_mode','office_days_number(hybrid)','work_place_satisfaction',
                'AI_tools_use','finish_survey(1)','highest_level_studies','status','degree/specialization',
                'university/school','finish_survey(2)','work_on-call_duty','salary_on-call_duty',
                'is_number','finish_survey(3)','age','gender']

for i, nuevo_nombre in enumerate(columns_names):
    df.rename(columns={df.columns[i]: nuevo_nombre}, inplace=True)

salaries = pd.concat([df.iloc[:,0:2],df.iloc[:,41:],df.iloc[:,16],
                      df.iloc[:,2:4],df.iloc[:,17:21],df.iloc[:,26:30],
                      df.iloc[:,4:16],df.iloc[:,21:26],df.iloc[:,30:41]],axis=1)

salaries.drop(inplace=True, columns=[
                                    'work_country','ARS/USD_exchange','2023_salary_adjusment_times',
                                    'percentage_adjustment','last_adjustment_month',
                                    'last_semester_salary_comparison','finish_survey(1)','finish_survey(2)',
                                    'work_on-call_duty','salary_on-call_duty','is_number','finish_survey(3)'
                                    ])

display(salaries.head(0))

Unnamed: 0,work_province,age,gender,work_title,work_dedication,work_contract_type,years_experience,years_in_company,years_in_position,people_in_charge,company_size,work_mode,office_days_number(hybrid),work_place_satisfaction,last_month_gross_salary,last_month_net_salary,salary_in_usd,salary_has_bonus,salary_bonus_tied_to,work_benefits,salary_satisfaction,platforms,programming_languages,frameworks,databases,qa_testing_tools,AI_tools_use,highest_level_studies,status,degree/specialization,university/school


**LIMPIEZA Y TRANSFORMACION DE NANS**

In [6]:
print(salaries.isna().sum()) # Recargar celda para visualizar cambios

#SALARY_IN_USD
#print(salaries['salary_in_usd'].unique())
salaries['salary_in_usd'].fillna('No cobro mi salario en dolares',inplace=True)

#TOOLS
salaries.loc[:, 'platforms':'AI_tools_use'] = salaries.loc[:, 'platforms':'AI_tools_use'].fillna('Ninguna de las anteriores')

#STUDIES
#print(df['finish_survey(1)'].unique())
#print(df['highest_level_studies'].unique())
for columns in ('highest_level_studies','status','degree/specialization','university/school'):
    salaries.loc[(df['finish_survey(1)'] == 'Terminar encuesta') |
                 (df['finish_survey(1)'] == 'Responder sobre guardias'), columns] = 'Prefiero no responder'

for columns in ('degree/specialization','university/school'):
    salaries.loc[(df['highest_level_studies'] == 'Secundario'), columns] = 'Prefiero no responder'

work_province                   0
age                             0
gender                          0
work_title                      0
work_dedication                 0
work_contract_type              0
years_experience                0
years_in_company                0
years_in_position               0
people_in_charge                0
company_size                    0
work_mode                       0
office_days_number(hybrid)      0
work_place_satisfaction         0
last_month_gross_salary         0
last_month_net_salary         239
salary_in_usd                   0
salary_has_bonus                0
salary_bonus_tied_to            0
work_benefits                   0
salary_satisfaction             0
platforms                       0
programming_languages           0
frameworks                      0
databases                       0
qa_testing_tools                0
AI_tools_use                    0
highest_level_studies           0
status                          0
degree/special

**LIMPIEZA DE DUPLICADOS**
