# Pandas

Pandas yra duomenų analizės biblioteka, sukurta NumPy pagrindu. Pandas yra pagrindinis įrankis Python aplinkoje, skirtas duomenų analizei, išvalymui ir paruošimui. Pandas pasižymi sparta ir produktyvumu. Galima dirbti su duomenimis iš įvairių šaltinių. 

Pandas diegiasi *conda install pandas* arba *pip install pandas*

In [18]:
import numpy as np
import pandas as pd

In [19]:
labels = ('vardas', 'pavarde', 'amzius')
zmogus = ('Geras', 'Zmogelis', 33)
print(labels, zmogus)

('vardas', 'pavarde', 'amzius') ('Geras', 'Zmogelis', 33)


In [20]:
pd.Series(data=zmogus)

0       Geras
1    Zmogelis
2          33
dtype: object

In [21]:
zmogus_dict = {
    'vrdas': 'Smagus',
    'pavarde': 'Giedriukas',
    'amzius': 36
}

In [22]:
zmogus_pd = pd.Series(zmogus_dict)

In [23]:
pd.Series(zmogus, labels)

vardas        Geras
pavarde    Zmogelis
amzius           33
dtype: object

In [24]:
zmogus_pd['amzius']

36

In [25]:
miestai = pd.Series((500, 400, 300, 200), ('Vilnius', 'Kaunas', 'Klaipėda', 'Šiauliai'))

In [26]:
augimas = pd.Series((1.5, 1.3, 0.9, 0.7), ('Vilnius', 'Kaunas', 'Klaipėda', 'Šiauliai'))

In [27]:
miestai * augimas

Vilnius     750.0
Kaunas      520.0
Klaipėda    270.0
Šiauliai    140.0
dtype: float64

# Serijos


Serijos (*Series*) yra smulkus pandas duomenų darinys, sukurtas ant NumPy array pagrindo. 

In [28]:
labels = ['x', 'y', 'z']
data = [20, 30, 40]
pd.Series(data=data)

0    20
1    30
2    40
dtype: int64

Matyti, kad nuo įprastų masyvų, pandas serija skiriasi tuo, kad turi indeksaciją. Vienas iš parametrų, kuriuos galime perduoti kurdami seriją yra index.

In [29]:
pd.Series(data=data, index=labels)

x    20
y    30
z    40
dtype: int64

Pandas series galima kurti ir su python žodynais:

In [30]:
zodynas = {'x':20, 'y':30, 'z':40}
pd.Series(zodynas)

x    20
y    30
z    40
dtype: int64

**Reikšmės traukimas iš serijos**

In [31]:
serija = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Klaipėda', 'Panevėžys', 'Šiauliai'])
# atkreipkite dėmesį, kad duomenis galima sudėti nebūtinai nurodant parametro pavadinimą.

In [32]:
serija


Vilnius      1
Kaunas       2
Klaipėda     3
Panevėžys    4
Šiauliai     5
dtype: int64

In [33]:
serija['Vilnius']

1

**Operacijos su serijomis**

In [34]:
serija2 = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Lentvaris', 'Šiauliai', 'Klaipėda'])

In [35]:
serija2

Vilnius      1
Kaunas       2
Lentvaris    3
Šiauliai     4
Klaipėda     5
dtype: int64

naudojant sudėtį, pandas pagal galimybes bandys sumuoti reikšmes:

In [36]:
serija + serija2

Kaunas       4.0
Klaipėda     8.0
Lentvaris    NaN
Panevėžys    NaN
Vilnius      2.0
Šiauliai     9.0
dtype: float64

Ten, kur pandos negalėjo atlikti sudėties veiksmo, sugeneravo NaN - *not a number*. Tiek Pandas, tiek NumPy mėgsta integer reikšmes versti į float, kad išlaikytų kiek įmanoma tikslesnę informaciją.

# DataFrames

DataFrames yra pagrindinis pandas operacijų objektas. Jeigu norime susikurti naują DF, reikia į parametrus perduoti *data*, *index*, *columns*: 

In [37]:
betko7x7 = np.random.randint(10, 100, (7, 7))

In [38]:
df = pd.DataFrame(np.random.rand(5,6), 
                  ['a', 'b', 'c', 'd', 'e'], 
                  ['U', 'V', 'W', 'X', 'Y', 'Z'])

In [39]:
df      

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604


Kiekvienas stulpelis yra pandas serija, jos tarpusavyje dalijasi indeksais (a, b, c, d, e), pvz.:

In [40]:
df['U']

a    0.863571
b    0.932546
c    0.543049
d    0.492632
e    0.765506
Name: U, dtype: float64

In [41]:
type(df['U'])

pandas.core.series.Series

**Jei norime daugiau stulpelių:**

In [42]:
df[['U', 'Y', 'Z']]

Unnamed: 0,U,Y,Z
a,0.863571,0.751997,0.922455
b,0.932546,0.71574,0.373847
c,0.543049,0.62236,0.11123
d,0.492632,0.473212,0.618707
e,0.765506,0.128662,0.903604


**Naujo stulpelio sukūrimas**

In [43]:
df['naujas'] = [1, 2, 3, 4, 5]

In [44]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455,1
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847,2
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123,3
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707,4
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604,5


**Stulpelio ištrynimas**

In [45]:
df.drop('naujas', axis=1)

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604


axis=0 reikštų, kad atliekame veiksmą su eilute. 1 tuo tarpu reiškia stulpelį.

**Inplace parametras**

paskutinis mūsų veiksmas originalaus šaltinio nepakeitė, jeigu dabar išsikviesime df, matysime, kad jis koks buvo, toks ir liko: 

In [46]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455,1
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847,2
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123,3
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707,4
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604,5


norėdami pakeisti originalą, turime nurodyti parametrą inplace=True:

In [47]:
df.drop('naujas', axis=1, inplace=True)

In [48]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604


*inplace* parametras apsaugo mus nuo netyčinio duomenų sugadinimo

**Pabandykime ištrinti eilutę:**

In [49]:
df.drop('e')

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707


trinant eilutę parametro axis=0 nurodyti nebūtina, tai yra *default* reikšmė

**Eilučių traukimas**

In [50]:
df.loc['e']

U    0.765506
V    0.752538
W    0.280722
X    0.570152
Y    0.128662
Z    0.903604
Name: e, dtype: float64

eilutes galime traukti ir pagal indeksą:

In [51]:
df.iloc[4]

U    0.765506
V    0.752538
W    0.280722
X    0.570152
Y    0.128662
Z    0.903604
Name: e, dtype: float64

**Subsets**

jeigu norime pavienės reikšmės iš lentelės:

In [52]:
df.loc['c', 'Z']

0.11122976520584293

jeigu norime fragmento iš eilučių ir stulpelių (*subset*):

In [53]:
df.loc[['a', 'c'], ['U', 'V', 'Z']]

Unnamed: 0,U,V,Z
a,0.863571,0.883254,0.922455
c,0.543049,0.68309,0.11123


**Duomenų traukimas pagal sąlygą:**

duomenų traukimas pagal sąlygą yra labai panašus, kaip ir numPy:

In [54]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604


In [55]:
df[df>0.4] 

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,,,0.912207,0.71574,
c,0.543049,0.68309,0.839093,,0.62236,
d,0.492632,0.405244,,0.502899,0.473212,0.618707
e,0.765506,0.752538,,0.570152,,0.903604


kur reikšmės atitinką sąlygą, turime reikšmes, kur neatitinka - NaN.

jeigu prireiktų subset'o, kur stulpelio 'W' reikšmės yra > 0.5:

In [56]:
df[df['W']>0.5]

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123


Skirtumas tarp šių operacijų toks, kad kai sąlygą taikome visam DataFrame'ui, gauname tą patį DataFrame su NaN reikšmėmis, tose vietose, kur originalios reikšmės neatitinka sąlygos. Kai sąlygą taikome stulpeliams, gauname tik tas eilutes, kurios atitinka sąlygą, t.y. vykdome filtravimą.

**Užklausų kombinavimas**

In [57]:
df[df['W']>0.5][['U', 'W', 'Z']]

Unnamed: 0,U,W,Z
a,0.863571,0.787205,0.922455
c,0.543049,0.839093,0.11123


šiame pavyzdyje gauname rezultatą, kokį gautumėm paeiliui ivykdę dvi atskiras eilutes: *df1 = df[df['W']>0.5], df1[['U', 'W', 'Z']]*. Užklausų kombinavimas leidžia mums nekurti atmintyje papildomų kintamųjų (kaip šiuo atveju *df1*).

**Sąlygų kombinavimas**

In [58]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604


In [59]:
df[(df['U']>0.5) & (df['Z']<0.5)]

Unnamed: 0,U,V,W,X,Y,Z
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123


gavome tas eilutes, kuriose U stulpelyje reikšmės didesnės, o Z stulpelyje mažesnės už 0.5.

In [60]:
df[(df['U']>0.5) & (df['Z']<0.5)][['U', 'Z']]

Unnamed: 0,U,Z
b,0.932546,0.373847
c,0.543049,0.11123


Čia sukombinavome dvi sąlygas ir iš rezultato paprašėme tik 2jų stulpelių

**Operacijos su index stulpeliu**

reset_index paverčia mūsų seną indeksą dar vienu stulpeliu, ir sukuria naują indeksą iš skaičių. Reikia naudoti *inplace=True*, jei norime pakeisti originalą.

In [61]:
df.reset_index()

Unnamed: 0,index,U,V,W,X,Y,Z
0,a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
1,b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
2,c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
3,d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
4,e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604


Norint sukurti naują indeksą, reikia pridėti naują stulpelį:

In [62]:
naujas_indeksas = 'Vilnius Kaunas Klaipėda Šiauliai Panevėžys'.split()

In [63]:
naujas_indeksas

['Vilnius', 'Kaunas', 'Klaipėda', 'Šiauliai', 'Panevėžys']

In [64]:
df['Miestai'] = naujas_indeksas

In [65]:
df

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455,Vilnius
b,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847,Kaunas
c,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123,Klaipėda
d,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707,Šiauliai
e,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604,Panevėžys


In [66]:
df.set_index('Miestai')

Unnamed: 0_level_0,U,V,W,X,Y,Z
Miestai,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Vilnius,0.863571,0.883254,0.787205,0.672174,0.751997,0.922455
Kaunas,0.932546,0.31762,0.241403,0.912207,0.71574,0.373847
Klaipėda,0.543049,0.68309,0.839093,0.282489,0.62236,0.11123
Šiauliai,0.492632,0.405244,0.093171,0.502899,0.473212,0.618707
Panevėžys,0.765506,0.752538,0.280722,0.570152,0.128662,0.903604
