# Pandas

Pandas yra duomenų analizės biblioteka, sukurta NumPy pagrindu. Pandas yra pagrindinis įrankis Python aplinkoje, skirtas duomenų analizei, išvalymui ir paruošimui. Pandas pasižymi sparta ir produktyvumu. Galima dirbti su duomenimis iš įvairių šaltinių. 

Pandas diegiasi *conda install pandas* arba *pip install pandas*

In [3]:
import numpy as np
import pandas as pd

# Serijos


Serijos (*Series*) yra smulkus pandas duomenų darinys, sukurtas ant NumPy array pagrindo. 

In [5]:
labels = ['x', 'y', 'z']
data = [20, 30, 40]
pd.Series(data=data)

0    20
1    30
2    40
dtype: int64

Matyti, kad nuo įprastų masyvų, pandas serija skiriasi tuo, kad turi indeksaciją. Vienas iš parametrų, kuriuos galime perduoti kurdami seriją yra index.

In [6]:
pd.Series(data=data, index=labels)

x    20
y    30
z    40
dtype: int64

Pandas series galima kurti ir su python žodynais:

In [8]:
zodynas = {'x':20, 'y':30, 'z':40}
pd.Series(zodynas)

x    20
y    30
z    40
dtype: int64

**Reikšmės traukimas iš serijos**

In [13]:
serija = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Klaipėda', 'Panevėžys', 'Šiauliai'])
# atkreipkite dėmesį, kad duomenis galima sudėti nebūtinai nurodant parametro pavadinimą.

In [14]:
serija


Vilnius      1
Kaunas       2
Klaipėda     3
Panevėžys    4
Šiauliai     5
dtype: int64

In [15]:
serija['Vilnius']

1

**Operacijos su serijomis**

In [17]:
serija2 = pd.Series([1,2,3,4,5], ['Vilnius', 'Kaunas', 'Lentvaris', 'Šiauliai', 'Klaipėda'])

In [18]:
serija2

Vilnius      1
Kaunas       2
Lentvaris    3
Šiauliai     4
Klaipėda     5
dtype: int64

naudojant sudėtį, pandas pagal galimybes bandys sumuoti reikšmes:

In [19]:
serija + serija2

Kaunas       4.0
Klaipėda     8.0
Lentvaris    NaN
Panevėžys    NaN
Vilnius      2.0
Šiauliai     9.0
dtype: float64

Ten, kur pandos negalėjo atlikti sudėties veiksmo, sugeneravo NaN - *not a number*. Tiek Pandas, tiek NumPy mėgsta integer reikšmes versti į float, kad išlaikytų kiek įmanoma tikslesnę informaciją.

# DataFrames

DataFrames yra pagrindinis pandas operacijų objektas. Jeigu norime susikurti naują DF, reikia į parametrus perduoti *data*, *index*, *columns*: 

In [4]:
df = pd.DataFrame(np.random.rand(5,6), 
                  ['a', 'b', 'c', 'd', 'e'], 
                  ['U', 'V', 'W', 'X', 'Y', 'Z'])

In [28]:
df      

Unnamed: 0,U,V,W,X,Y,Z
a,0.773002,0.559221,0.214122,0.979269,0.652045,0.804117
b,0.278172,0.248508,0.791801,0.03061,0.699674,0.773401
c,0.740262,0.057751,0.018922,0.682314,0.384986,0.099494
d,0.201956,0.846751,0.637614,0.537669,0.401869,0.601869
e,0.4361,0.32624,0.186233,0.10197,0.060907,0.151314


Kiekvienas stulpelis yra pandas serija, jos tarpusavyje dalijasi indeksais (a, b, c, d, e), pvz.:

In [5]:
df['U']

a    0.361370
b    0.174726
c    0.605317
d    0.931402
e    0.113378
Name: U, dtype: float64

In [6]:
type(df['U'])

pandas.core.series.Series

**Jei norime daugiau stulpelių:**

In [9]:
df[['U', 'Y', 'Z']]

Unnamed: 0,U,Y,Z
a,0.36137,0.462288,0.504773
b,0.174726,0.128283,0.042082
c,0.605317,0.351447,0.072083
d,0.931402,0.232773,0.296398
e,0.113378,0.12442,0.666479


**Naujo stulpelio sukūrimas**

In [13]:
df['naujas'] = [1, 2, 3, 4, 5]

In [14]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773,1
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082,2
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083,3
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398,4
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479,5


**Stulpelio ištrynimas**

In [19]:
df.drop('naujas', axis=1)

Unnamed: 0,U,V,W,X,Y,Z
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479


axis=0 reikštų, kad atliekame veiksmą su eilute. 1 tuo tarpu reiškia stulpelį.

**Inplace parametras**

paskutinis mūsų veiksmas originalaus šaltinio nepakeitė, jeigu dabar išsikviesime df, matysime, kad jis koks buvo, toks ir liko: 

In [22]:
df

Unnamed: 0,U,V,W,X,Y,Z,naujas
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773,1
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082,2
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083,3
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398,4
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479,5


norėdami pakeisti originalą, turime nurodyti parametrą inplace=True:

In [24]:
df.drop('naujas', axis=1, inplace=True)

In [25]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479


*inplace* parametras apsaugo mus nuo netyčinio duomenų sugadinimo

**Pabandykime ištrinti eilutę:**

In [26]:
df.drop('e')

Unnamed: 0,U,V,W,X,Y,Z
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398


trinant eilutę parametro axis=0 nurodyti nebūtina, tai yra *default* reikšmė

**Eilučių traukimas**

In [28]:
df.loc['e']

U    0.113378
V    0.060269
W    0.325921
X    0.646713
Y    0.124420
Z    0.666479
Name: e, dtype: float64

eilutes galime traukti ir pagal indeksą:

In [31]:
df.iloc[4]

U    0.113378
V    0.060269
W    0.325921
X    0.646713
Y    0.124420
Z    0.666479
Name: e, dtype: float64

**Subsets**

jeigu norime pavienės reikšmės iš lentelės:

In [33]:
df.loc['c', 'Z']

0.07208345176824538

jeigu norime fragmento iš eilučių ir stulpelių (*subset*):

In [36]:
df.loc[['a', 'c'], ['U', 'V', 'Z']]

Unnamed: 0,U,V,Z
a,0.36137,0.902063,0.504773
c,0.605317,0.887836,0.072083


**Duomenų traukimas pagal sąlygą:**

duomenų traukimas pagal sąlygą yra labai panašus, kaip ir numPy:

In [37]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479


In [39]:
df[df>0.4] 

Unnamed: 0,U,V,W,X,Y,Z
a,,0.902063,0.702827,0.98688,0.462288,0.504773
b,,0.518,,0.839525,,
c,0.605317,0.887836,,,,
d,0.931402,0.90541,0.531652,0.587163,,
e,,,,0.646713,,0.666479


kur reikšmės atitinką sąlygą, turime reikšmes, kur neatitinka - NaN.

jeigu prireiktų subset'o, kur stulpelio 'W' reikšmės yra > 0.5:

In [41]:
df[df['W']>0.5]

Unnamed: 0,U,V,W,X,Y,Z
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398


Skirtumas tarp šių operacijų toks, kad kai sąlygą taikome visam DataFrame'ui, gauname tą patį DataFrame su NaN reikšmėmis, tose vietose, kur originalios reikšmės neatitinka sąlygos. Kai sąlygą taikome stulpeliams, gauname tik tas eilutes, kurios atitinka sąlygą, t.y. vykdome filtravimą.

**Užklausų kombinavimas**

In [44]:
df[df['W']>0.5][['U', 'W', 'Z']]

Unnamed: 0,U,W,Z
a,0.36137,0.702827,0.504773
d,0.931402,0.531652,0.296398


šiame pavyzdyje gauname rezultatą, kokį gautumėm paeiliui ivykdę dvi atskiras eilutes: *df1 = df[df['W']>0.5], df1[['U', 'W', 'Z']]*. Užklausų kombinavimas leidžia mums nekurti atmintyje papildomų kintamųjų (kaip šiuo atveju *df1*).

**Sąlygų kombinavimas**

In [45]:
df

Unnamed: 0,U,V,W,X,Y,Z
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479


In [46]:
df[(df['U']>0.5) & (df['Z']<0.5)]

Unnamed: 0,U,V,W,X,Y,Z
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398


gavome tas eilutes, kuriose U stulpelyje reikšmės didesnės, o Z stulpelyje mažesnės už 0.5.

In [47]:
df[(df['U']>0.5) & (df['Z']<0.5)][['U', 'Z']]

Unnamed: 0,U,Z
c,0.605317,0.072083
d,0.931402,0.296398


Čia sukombinavome dvi sąlygas ir iš rezultato paprašėme tik 2jų stulpelių

**Operacijos su index stulpeliu**

reset_index paverčia mūsų seną indeksą dar vienu stulpeliu, ir sukuria naują indeksą iš skaičių. Reikia naudoti *inplace=True*, jei norime pakeisti originalą.

In [48]:
df.reset_index()

Unnamed: 0,index,U,V,W,X,Y,Z
0,a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
1,b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
2,c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
3,d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398
4,e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479


Norint sukurti naują indeksą, reikia pridėti naują stulpelį:

In [49]:
naujas_indeksas = 'Vilnius Kaunas Klaipėda Šiauliai Panevėžys'.split()

In [50]:
naujas_indeksas

['Vilnius', 'Kaunas', 'Klaipėda', 'Šiauliai', 'Panevėžys']

In [52]:
df['Miestai'] = naujas_indeksas

In [54]:
df

Unnamed: 0,U,V,W,X,Y,Z,Miestai
a,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773,Vilnius
b,0.174726,0.518,0.11926,0.839525,0.128283,0.042082,Kaunas
c,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083,Klaipėda
d,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398,Šiauliai
e,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479,Panevėžys


In [57]:
df.set_index('Miestai')

Unnamed: 0_level_0,U,V,W,X,Y,Z
Miestai,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Vilnius,0.36137,0.902063,0.702827,0.98688,0.462288,0.504773
Kaunas,0.174726,0.518,0.11926,0.839525,0.128283,0.042082
Klaipėda,0.605317,0.887836,0.094214,0.327027,0.351447,0.072083
Šiauliai,0.931402,0.90541,0.531652,0.587163,0.232773,0.296398
Panevėžys,0.113378,0.060269,0.325921,0.646713,0.12442,0.666479
