# PANDAS

Pandas is a free software library written for the Python programming language **for data manipulation and analysis**. In particular, **it offers data structures and operations for manipulating numerical tables and time series**. Pandas is mainly used for machine learning in form of DataFrames. **Pandas allows importing data of various file formats such as csv, excel etc. And it provides various data manipulation operations such as groupby, join, merge, melt, concatenation as well as data cleaning features such as filling, replacing or imputing null values**.

**It can perform five significant steps that are required for processing and analysis of data : load, manipulate, prepare, model, and analyze.**

# Series



In [1]:
import pandas as pd
import numpy as np

**Pandas series is a one-dimensional labeled array capable of holding data of any type** (integer, string, float, python objects, etc.). In order to fully understand DataFrames, you need to know the basics of series. You can think of the pandas series as **a column with labels** in an excel sheet. 

A Series is defined as a **one-dimensional array that is capable of storing various data types**. **The row labels of series are called the index**. By using a Series method, we can easily convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.

In [3]:
serie = pd.Series([1,2,3,"five", True, [1,2,3], {"a": 1},4,5])
serie

0            1
1            2
2            3
3         five
4         True
5    [1, 2, 3]
6     {'a': 1}
7            4
8            5
dtype: object

In a series, **the axis labels are called indexes.** **Series can only contain a single list with an index, whereas the DataFrames can be made of more than one series.** 

**You can create a series by calling pandas. Series() . A list, numpy array, dict and scalar value can be turned into a pandas series.**

**``pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)``**

Additional sources:

[SOURCE01](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), [SOURCE02](https://www.datasciencemadesimple.com/create-series-in-python-pandas/), 
[SOURCE03](https://www.cbsecsip.in/2020/08/pandas-data-structure-series.html), 
[SOURCE04](https://towardsdatascience.com/pandas-series-a-lightweight-intro-b7963a0d62a2), 
[SOURCE05](https://www.educba.com/pandas-series/), 
[SOURCE06](http://www.rebellionrider.com/python-pandas-series-in-detail-part-1/), 
[SOURCE06](https://www.javatpoint.com/python-pandas-series), &
[SOURCE06](https://towardsdatascience.com/20-examples-to-master-pandas-series-bc4c68200324)<br>

In [6]:
my_list = [10, 20, 30]
labels = ['a', 'b', 'c']
arr = np.array([10, 20, 30])
d = {'dictkey1': 10, 'dictkey2': 20, 'dictkey3': 30}

# 1 turn a list into series

print(pd.Series(my_list))  
print(pd.Series([1,2,3]))
print()
# 2 array ile serie olustırma
print(pd.Series(data = arr, index = labels))  # index yazmsak default 0'dan itibaren sıralar
print("----------------------------")
print(pd.Series(data = arr, index = [1,2,3]))  # listeye eleman sayısından fazla value girersek error
print()
print(pd.Series(data=arr, index = np.arange(3)))
print()
# 3 dict ile serie olusturma
print(pd.Series(d))  # keyler index valuelar value olur

0    10
1    20
2    30
dtype: int64
0    1
1    2
2    3
dtype: int64

a    10
b    20
c    30
dtype: int64
----------------------------
1    10
2    20
3    30
dtype: int64

0    10
1    20
2    30
dtype: int64

dictkey1    10
dictkey2    20
dictkey3    30
dtype: int64


In [250]:
# dict indexlerini kendimiz vermek istersej

pd.Series(data=d, index = ["q", "dictkey2", "y", "p"])  # dictte olan keyi getirir, diğerleri benim datamı 
# olusturan dict'te olmadigi icin onları getiremedi ve NaN (not a (valid) number) koydu

q            NaN
dictkey2    20.0
y            NaN
p            NaN
dtype: float64

In [251]:
# dict ile series oluştururken datamızdakinden daha uzun index verebiliyormusuz demek ki yukarda oldugu gibi. 

**Scalars** are single values representing one unit of data, such as an integer or bool, as opposed to data structures like a list or tuple, which are composed of scalars.

In [259]:
# 4. scalar degerle serie olusturma
pd.Series(data = "GSaray", index = range(3)) 

0    GSaray
1    GSaray
2    GSaray
dtype: object

In [388]:
pd.Series(data = 10, index = ["a", "b", "c"], name = "Serie of 10") 

a    10
b    10
c    10
Name: Serie of 10, dtype: int64

In [305]:
# seriler set, dict gibi fonksiyonlar da alabilir

ser = pd.Series([set, list,print, dict])
ser[0]([1,2,3,4,5])

# büyü projelerde fonksyionlarla dolu bir seri olusturup daha sonra tek tek elemanlarını veya 
# loopla teker teker hepsini kullanabiliriz. serinin verinin tipini koruma ozelliginden 
# faydalanmis oluruz böylece.

{1, 2, 3, 4, 5}

In [307]:
ser[2]("pandas serileri")  # [2] print fonksiyonu

pandas serileri


In [309]:
func = pd.Series([sum, print, len])
print(func[0]([1,2,3,4,5]))  # sum func
print(func[2]([1,2,3,4,5]))  # len func

15
5


**SOME COMMON ATTRIBUTES** [Official Pandas API Document](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)<br>

**Series.dtype**	It returns the data type of the data.<br>
**Series.shape**	It returns a tuple of shape of the data.<br>
**Series.size**	    It returns the size of the data.<br>
**Series.ndim**	    It returns the number of dimensions in the data.<br>
**Series.index**	Defines the index of the Series.<br>
**Series.keys**  	Return alias for index.<br>
**Series.values**   Returns Series as ndarray or ndarray-like depending on the dtype.<br>
**Series.items**	Lazily iterate over (index, value) tuples.<br>
**Series.head**   	Return the first n rows.<br>
**Series.tail** 	Return the last n rows.<br>
**Series.sample**   Return a random sample of items from an axis of object.<br>
**Series.sort_index**  Sort Series by index labels.<br>
**Series.sort_values**  Sort by the values.<br>
**Series.isin**     Whether elements in Series are contained in values.<br>

In [330]:
ser = pd.Series(np.random.randint(0,100,7))

print(ser.dtype)
print(ser.shape)
print(ser.size)
print(ser.ndim)
print()
print(ser.index)
print()
print(list(ser.index))
print("-------------------")
print(ser.keys)  # keys ile de indeksleri cekeblirilz
# .keys: print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
# Prints the values to a stream, or to sys.stdout by default.

print("Values", ser.values)
print(ser.items)  # index value pair olarak gelir
print("-------------------")
print("-------------------")
for index,value in ser.items():
    print(f"index : {index}, Value : {value}")

int64
(7,)
7
1

RangeIndex(start=0, stop=7, step=1)

[0, 1, 2, 3, 4, 5, 6]
-------------------
<bound method Series.keys of 0    54
1    71
2    72
3    80
4    93
5    64
6    63
dtype: int64>
Values [54 71 72 80 93 64 63]
<bound method Series.items of 0    54
1    71
2    72
3    80
4    93
5    64
6    63
dtype: int64>
-------------------
-------------------
index : 0, Value : 54
index : 1, Value : 71
index : 2, Value : 72
index : 3, Value : 80
index : 4, Value : 93
index : 5, Value : 64
index : 6, Value : 63


In [324]:
ser_label = pd.Series(data = [121, 200, 150, 99], index = ["terry", "micheal", "orion", "jason"])
print(ser_label)
print(ser_label.index)  # label olduğunda acik acik gosterir, list,* veya for döngüye sokmaya gerek yok

terry      121
micheal    200
orion      150
jason       99
dtype: int64
Index(['terry', 'micheal', 'orion', 'jason'], dtype='object')


In [394]:
ser = pd.Series(list("Galatasaray"))
ser

0     G
1     a
2     l
3     a
4     t
5     a
6     s
7     a
8     r
9     a
10    y
dtype: object

In [390]:
ser.items

<bound method Series.items of 0     G
1     a
2     l
3     a
4     t
5     a
6     s
7     a
8     r
9     a
10    y
dtype: object>

In [391]:
ser.keys

<bound method Series.keys of 0     G
1     a
2     l
3     a
4     t
5     a
6     s
7     a
8     r
9     a
10    y
dtype: object>

In [278]:
ser.index

RangeIndex(start=0, stop=11, step=1)

In [279]:
s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
s.index

Int64Index([3, 2, 1, 4], dtype='int64')

In [331]:
ser9 = pd.Series(data = np.random.randint(0,25,10), index = [i for i in "cbaefghjik"])
ser9

c     8
b     7
a    16
e     6
f    13
g     7
h     7
j    20
i     3
k    11
dtype: int64

# .sort_index : 

Sort Series by index labels: Pandas dataframe.sort_index() function sorts objects by labels along the given axis. 
Basically the sorting algorithm is applied on the axis labels rather than the actual data in the dataframe and based on that the data is rearranged. We have the freedom to choose what sorting algorithm we would like to apply. There are three possible sorting algorithms that we can use ‘quicksort’, ‘mergesort’ and ‘heapsort’.

In [9]:
s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
s.sort_index()  # ama bu s'i değiştirmez. s desek yine eski hali gelir.
# eğer key parametresi olan (inplace= True) yaparsak o zaman kalıcı değiştirir.

1    c
2    b
3    a
4    d
dtype: object

In [12]:
# sort descending
ser.sort_index(ascending = False)

10    y
9     a
8     r
7     a
6     s
5     a
4     t
3     a
2     l
1     a
0     G
dtype: object

# **.sort_values :** 

Sort a Series in ascending or descending order by the values

In [181]:
s = pd.Series([0, 1, 3, 10, 5])
print(s)
# Sort values ascending order (default behaviour)
print(s.sort_values())
#Sort values descending order
s.sort_values(ascending=False)

0     0
1     1
2     3
3    10
4     5
dtype: int64
0     0
1     1
2     3
4     5
3    10
dtype: int64


3    10
4     5
2     3
1     1
0     0
dtype: int64

# .isin : 

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly

In [399]:
ser.isin(["r"])  # [] icinde vermeyince error verdi

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8      True
9     False
10    False
dtype: bool

In [182]:
s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama','hippo'], name='animal')
s.isin(['cow', 'lama'])

0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

In [24]:
# To invert the boolean values, use the ``~`` operator:
~s.isin(['cow', 'lama'])

0    False
1    False
2    False
3     True
4    False
5     True
Name: animal, dtype: bool

# .keys : 

return the index labels of the given series object

In [261]:
ser.keys()

RangeIndex(start=0, stop=11, step=1)

In [262]:
data5.keys

<bound method NDFrame.keys of    age section     city gender favourite_color
0   10       A  Gurgaon      M             red
1   22       B    Delhi      F            blue
2   13       C   Mumbai      F          yellow
3   21       B    Delhi      M            pink
4   12       B   Mumbai      M           black
5   11       A    Delhi      M           green
6   17       A   Mumbai      F             red>

# .values : 

Return Series as ndarray or ndarray-like depending on the dtype

In [27]:
ser.values  # without ()

array(['G', 'a', 'l', 'a', 't', 'a', 's', 'a', 'r', 'a', 'y'],
      dtype=object)

In [28]:
pd.Series([1, 2, 3]).values

array([1, 2, 3])

In [263]:
pd.Series(list('aabc')).values

array(['a', 'a', 'b', 'c'], dtype=object)

In [264]:
data5.values 

array([[10, 'A', 'Gurgaon', 'M', 'red'],
       [22, 'B', 'Delhi', 'F', 'blue'],
       [13, 'C', 'Mumbai', 'F', 'yellow'],
       [21, 'B', 'Delhi', 'M', 'pink'],
       [12, 'B', 'Mumbai', 'M', 'black'],
       [11, 'A', 'Delhi', 'M', 'green'],
       [17, 'A', 'Mumbai', 'F', 'red']], dtype=object)

# .items : 
    
This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.

In [30]:
ser.items()

<zip at 0x7fbd733958c0>

In [34]:
for index,value in ser.items():
    print(value,index)

G 0
a 1
l 2
a 3
t 4
a 5
s 6
a 7
r 8
a 9
y 10


In [31]:
s = pd.Series(['A', 'B', 'C'])
for index, value in s.items():
    print(f"Index : {index}, Value : {value}")

Index : 0, Value : A
Index : 1, Value : B
Index : 2, Value : C


# 1. indexing and slicing with series

In [412]:

ser1 = pd.Series([1, 2, 3, 4], index = ['USA', 'Germany','RF', 'Japan'])
ser2 = pd.Series([1, 2, 5, 4, 6], index = ['USA', 'Germany','Italy', 'Japan', 'Spain'])
ser1

USA        1
Germany    2
RF         3
Japan      4
dtype: int64

In [347]:
print(ser1[3])
print(ser1["Japan"])
print(ser1.index)
print(ser1.index[3])
# japan kacta bilmiyor ama ogrenmek istiyorsak:

print(ser1.index.get_loc("Japan"))  # get location of Japan

4
4
Index(['USA', 'Germany', 'RF', 'Japan'], dtype='object')
Japan
3
4


In [349]:
print(ser1.shape, ser2.shape)
ser1+ser2  # array hic toplamazdı farklı len diye. burda indexleri aynı olanları toplayıp her 2 seride de 
# olmayanlara NaN verdi.

(4,) (5,)


Germany    4.0
Italy      NaN
Japan      8.0
RF         NaN
Spain      NaN
USA        2.0
dtype: float64

In [350]:
ser1 * ser2

Germany     4.0
Italy       NaN
Japan      16.0
RF          NaN
Spain       NaN
USA         1.0
dtype: float64

In [418]:
ser_label = pd.Series(data = [121, 200, 150, 99], index = ["terry", "micheal", "orion", "jason"])
print(ser_label[[0,2]])  # birden fazla getireceksek [] icine.
ser_label[0:2]  # slicing tek []

terry    121
orion    150
dtype: int64


terry      121
micheal    200
dtype: int64

In [353]:
ser_label[["terry", "orion"]]

terry    121
orion    150
dtype: int64

In [355]:
ser_label[0:3]  # 3 dahil değil

terry      121
micheal    200
orion      150
dtype: int64

In [354]:
# labella slicing:
# önemli bir fark. label olarak girince son label da (orion) inclusive oluyor.
ser_label["terry" : "orion"]

terry      121
micheal    200
orion      150
dtype: int64

**Selection with condition and broadcasting in series**

In [419]:
ser_label = pd.Series(data = [121, 200, 150, 99], index = ["terry", "micheal", "orion", "jason"])

"terry" in ser_label  # bu index arar

True

In [358]:
# value basıl aratacacagiz
print(121 in ser_label)  # bu indexte aradı ve false verdi
121 in ser_label.values  # valuesda aradi

False


True

In [361]:
print(ser_label < 100)  # bu false treu getirri
print()
print(ser_label[ser_label < 100])  # bu değerleri getirir. mantil slicelama gibi

terry      False
micheal    False
orion      False
jason       True
dtype: bool

jason    99
dtype: int64


In [362]:
ser_label[ser_label<100] = 100
# 100den kucuk olanları 100 yap demek bu. jason 100 oldu
ser_label

terry      121
micheal    200
orion      150
jason      100
dtype: int64

In [367]:
print(ser_label.isin([121]))
print("-------------------------------------")
print(ser_label[ser_label.isin([121])])
print("-------------------------------------")
ser_label[ser_label.isin([121])] = 125
print(ser_label)

terry      False
micheal    False
orion      False
jason      False
dtype: bool
-------------------------------------
Series([], dtype: int64)
-------------------------------------
terry      125
micheal    200
orion      150
jason      100
dtype: int64
