## Dataframe

Pandas esta construido en base a dos librerías importantes, el primero es **NumPy**, este proporciona objetos de matrices multidimensionales para manipular datos facilmente y **matplotlib** tiene potentes capacidades de visualización de datos.

### Métodos
1. `.head()`, devuleve las primeras 5 filas de una dataframe.
2. `.info()`, muestra los nombres de las columnas, los tipos de datos y si tienen valores nulos.
3. `.describe()`, calcula estadísticas básicas como media, std, numero de valores, percetil 25, 50, 75%.

### Atributos
1. `.shape`, nos devuleve una tupla que contien el número de filas seguido del número de columnas.
2. `.values`, contiene los valores de los datos de una matriz numby bidimensional.
3. `.columns`, contiene los nombres de las columnas.
4. `.index`, contiene los números o nombres de columna.

In [2]:
import pandas as pd

In [14]:
# Leer como dataframe un archivo .csv
homelessness = pd.read_csv('dataset/homelessness.csv')

El dataframe `homelessness` contiene estimaciones de las personas sin hogar en cada estado de EE.UU en 2018. La columna `individuals` es el número de personas sin hogar que no forman parte de una familia con hijos. La columna `family_members` es el número de personas sin hogar que si forman parde de una familia con hijos. La columna `state_pop` es la población total del estado.

In [15]:
# Devolver las primeras filas del dataframe
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


In [16]:
# Devolver la información sobre el dataframe
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      51 non-null     int64  
 1   region          51 non-null     object 
 2   state           51 non-null     object 
 3   individuals     51 non-null     float64
 4   family_members  51 non-null     float64
 5   state_pop       51 non-null     int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 2.5+ KB


In [17]:
# Obtener el número de filas y columnas
homelessness.shape

(51, 6)

In [18]:
# Obtener la descripción del dataframe
homelessness.describe()

Unnamed: 0.1,Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0,51.0
mean,25.0,7225.784314,3504.882353,6405637.0
std,14.866069,15991.025083,7805.411811,7327258.0
min,0.0,434.0,75.0,577601.0
25%,12.5,1446.5,592.0,1777414.0
50%,25.0,3082.0,1482.0,4461153.0
75%,37.5,6781.5,3196.0,7340946.0
max,50.0,109008.0,52070.0,39461590.0


In [None]:
# Obtener la matriz NumPy bidimensional de valores
homelessness.values

array([[0, 'East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       [1, 'Pacific', 'Alaska', 1434.0, 582.0, 735139],
       [2, 'Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       [3, 'West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       [4, 'Pacific', 'California', 109008.0, 20964.0, 39461588],
       [5, 'Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       [6, 'New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       [7, 'South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       [8, 'South Atlantic', 'District of Columbia', 3770.0, 3134.0,
        701547],
       [9, 'South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       [10, 'South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       [11, 'Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       [12, 'Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       [13, 'East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       [14, 'East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
    

In [20]:
# Imprime los nombres de las columnas del dataframe
homelessness.columns

Index(['Unnamed: 0', 'region', 'state', 'individuals', 'family_members',
       'state_pop'],
      dtype='object')

In [21]:
# Imprime el índice del dataframe
homelessness.index

RangeIndex(start=0, stop=51, step=1)

# Clasificació y subconjunto

1. Ordenación: podrías usar el nombre de una o mas columnas `.sort_values(nombre_columna, ascendente=True)`
2. Obtener un subconjunto del dataframe, `dataframe[nombre_columna]`, si necesitas mas columnas `dataframe[[nombre_col1, nombre_col2]]`.
3. Subconjunto de filas, para ello se utiliza condiciones lógicas.
    - `dataframe[nombre_columna] > 50`
    - `dataframe[dataframe[nombre_columnas] > 50]`
    - `dataframe[(dataframe[nombre_columna] == string) & (dataframe[nombre_columnas] == fecha)]`
4. Filtrar sobre varios valores de una variable categórixa, la forma mas sencilla es utiliza el método `.isin()`
    - `dataframe[dataframe[nombre_columna].isin([valor1, valor2])]`

In [24]:
# Ordenar el dataframe por el número de personas sin hogar de la columna `individuals`, de menor a mayor.
homelessness_ind = homelessness.sort_values('individuals')
homelessness_ind.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
50,50,Mountain,Wyoming,434.0,205.0,577601
34,34,West North Central,North Dakota,467.0,75.0,758080
7,7,South Atlantic,Delaware,708.0,374.0,965479
39,39,New England,Rhode Island,747.0,354.0,1058287
45,45,New England,Vermont,780.0,511.0,624358


In [25]:
# Ordena el dataframe por el numero de `family_members` sin hogar en orden descendente
homelessness_fam = homelessness.sort_values('family_members', ascending=False)
homelessness_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,4,Pacific,California,109008.0,20964.0,39461588
21,21,New England,Massachusetts,6811.0,13257.0,6882635
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
43,43,West South Central,Texas,19199.0,6111.0,28628666


In [28]:
# Ordena el dataframe primero por región ascendente y luego por número de miembrso de la familia descendente
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])
homelessness_reg_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
13,13,East North Central,Illinois,6752.0,3891.0,12723071
35,35,East North Central,Ohio,6929.0,3320.0,11676341
22,22,East North Central,Michigan,5209.0,3142.0,9984072
49,49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,14,East North Central,Indiana,3776.0,1482.0,6695497


In [29]:
# Crea una serie llamada individuals que contenga solo la columna individuals
individuals = homelessness['individuals']
individuals.head()

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64

In [31]:
# Crea un dataframe llamado state_fam que contenga solo las columnas state y family_members
state_fam = homelessness[['state', 'family_members']]
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0


In [32]:
# Crea un dataframe llamado ind_state que contenga las columnas individuals y state
ind_state = homelessness[['individuals', 'state']]
ind_state.head()

Unnamed: 0,individuals,state
0,2570.0,Alabama
1,1434.0,Alaska
2,7259.0,Arizona
3,2280.0,Arkansas
4,109008.0,California


In [33]:
# Filtra el dataframe para los casos en los que el número de individuals sea superior a diez mil
ind_gt_10k = homelessness[homelessness['individuals'] > 10_000]
ind_gt_10k

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
4,4,Pacific,California,109008.0,20964.0,39461588
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,37,Pacific,Oregon,11139.0,3337.0,4181886
43,43,West South Central,Texas,19199.0,6111.0,28628666
47,47,Pacific,Washington,16424.0,5880.0,7523869


In [35]:
# Filtra el dataframe para los casos en los que el código del censo de EE.UU region es "Mountain"
mountain_reg = homelessness[homelessness['region'].isin(['Mountain'])]
mountain_reg

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
5,5,Mountain,Colorado,7607.0,3250.0,5691287
12,12,Mountain,Idaho,1297.0,715.0,1750536
26,26,Mountain,Montana,983.0,422.0,1060665
28,28,Mountain,Nevada,7058.0,486.0,3027341
31,31,Mountain,New Mexico,1949.0,602.0,2092741
44,44,Mountain,Utah,1904.0,972.0,3153550
50,50,Mountain,Wyoming,434.0,205.0,577601


In [36]:
# Filtra el dataframe para los casos en los que el número de family_members sea inferior a mil y el region sea "Pacific"
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1_000) & (homelessness['region'] == "Pacific")]
fam_lt_1k_pac

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
1,1,Pacific,Alaska,1434.0,582.0,735139


In [37]:
# Filtra el dataframe para los casos en los que el censo de EE.UU state aparece en la lista de estados de Mojave, canu
canu = ['California', 'Arizona', 'Nevada', 'Utah']
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]
mojave_homelessness

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
4,4,Pacific,California,109008.0,20964.0,39461588
28,28,Mountain,Nevada,7058.0,486.0,3027341
44,44,Mountain,Utah,1904.0,972.0,3153550


# Nuevas columnas

Crear nuevas columnas en un dataframe se realiza de la siguiente manera `dataframe[nueva_columna] = dataframe[nombre_columna] / 100`

In [39]:
# Añade una nueva columna al dataframe, denominada total, que contenga la suma de las columnas individual y family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,total
0,0,East South Central,Alabama,2570.0,864.0,4887681,3434.0
1,1,Pacific,Alaska,1434.0,582.0,735139,2016.0
2,2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0
3,3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0
4,4,Pacific,California,109008.0,20964.0,39461588,129972.0


In [41]:
# Añade otra columna denominada p_homeless, que contenga la proporcion de la poblacion de personas sin hogar total respecto a la poblacion totoal de cada estado state_pop
homelessness['p_homeless'] = homelessness['total'] / homelessness['state_pop']
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless
0,0,East South Central,Alabama,2570.0,864.0,4887681,3434.0,0.000703
1,1,Pacific,Alaska,1434.0,582.0,735139,2016.0,0.002742
2,2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0,0.001378
3,3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0,0.000901
4,4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.003294


In [43]:
# Añade una columnas indiv_per_10k, que contenga el número de personas sin hogar por cada 10 mil personas en cada estado, utilizando state_pop para la poblacion del estado
homelessness['indiv_per_10k'] = 10_000 * homelessness['individuals'] / homelessness['state_pop']
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless,indiv_per_10k
0,0,East South Central,Alabama,2570.0,864.0,4887681,3434.0,0.000703,5.258117
1,1,Pacific,Alaska,1434.0,582.0,735139,2016.0,0.002742,19.506515
2,2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0,0.001378,10.141067
3,3,West South Central,Arkansas,2280.0,432.0,3009733,2712.0,0.000901,7.575423
4,4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.003294,27.623825


In [44]:
# Subconjunta las filas en las que indiv_per_10k sea superior a 20
high_homelessness = homelessness[homelessness['indiv_per_10k'] > 20]
high_homelessness

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless,indiv_per_10k
4,4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.003294,27.623825
8,8,South Atlantic,District of Columbia,3770.0,3134.0,701547,6904.0,0.009841,53.738381
11,11,Pacific,Hawaii,4131.0,2399.0,1420593,6530.0,0.004597,29.079406
28,28,Mountain,Nevada,7058.0,486.0,3027341,7544.0,0.002492,23.314189
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351,91897.0,0.004705,20.392363
37,37,Pacific,Oregon,11139.0,3337.0,4181886,14476.0,0.003462,26.636307
47,47,Pacific,Washington,16424.0,5880.0,7523869,22304.0,0.002964,21.829195


In [45]:
# Ordena high_homelessness aplicando el orden descendente a indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values('indiv_per_10k', ascending=False)
high_homelessness_srt

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_homeless,indiv_per_10k
8,8,South Atlantic,District of Columbia,3770.0,3134.0,701547,6904.0,0.009841,53.738381
11,11,Pacific,Hawaii,4131.0,2399.0,1420593,6530.0,0.004597,29.079406
4,4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.003294,27.623825
37,37,Pacific,Oregon,11139.0,3337.0,4181886,14476.0,0.003462,26.636307
28,28,Mountain,Nevada,7058.0,486.0,3027341,7544.0,0.002492,23.314189
47,47,Pacific,Washington,16424.0,5880.0,7523869,22304.0,0.002964,21.829195
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351,91897.0,0.004705,20.392363


In [46]:
# Selecciona solo las columnas state y indiv_per_10k de higg_homelessness_srt y guardalas como result
result = high_homelessness_srt[['state', 'indiv_per_10k']]
result

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363
