# 데이터 시리즈와 프레임 다루기

## Series: 1-D data vector (similar to np.array)
Numpy array: https://numpy.org/doc/stable/reference/generated/numpy.array.html

Pandas series: https://pandas.pydata.org/docs/reference/series.html

In [1]:
import pandas as pd
# 마지막 값이 잘못되었다. 잠시 후에 이를 수정해보자.
inflation = pd.Series((2.2, 3.4, 2.8, 1.6, 2.3, 2.7, 3.4, 3.2, 2.8, 3.8, -0.4, 1.6, 3.2, 2.1, 1.5, 1.5))
inflation

0     2.2
1     3.4
2     2.8
3     1.6
4     2.3
5     2.7
6     3.4
7     3.2
8     2.8
9     3.8
10   -0.4
11    1.6
12    3.2
13    2.1
14    1.5
15    1.5
dtype: float64

In [2]:
len(inflation)

16

In [3]:
inflation.values

array([ 2.2,  3.4,  2.8,  1.6,  2.3,  2.7,  3.4,  3.2,  2.8,  3.8, -0.4,
        1.6,  3.2,  2.1,  1.5,  1.5])

### Pandas datastructure has "index" --> Good compatibility with Dictionary
https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html#pandas.Series.index
--> Returns: Index

In [4]:
inflation.index

RangeIndex(start=0, stop=16, step=1)

In [13]:
type(inflation.index)

pandas.core.indexes.numeric.Int64Index

https://pandas.pydata.org/docs/reference/api/pandas.Index.values.html

In [5]:
inflation.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [6]:
import numpy as np

inflation = pd.Series({1999: 2.2, 
                       2000: 3.4, 
                       2001: 2.8, 
                       2002: 1.6, 
                       2003: 2.3, 
                       2004: 2.7, 
                       2005: 3.4, 
                       2006: 3.2, 
                       2007: 2.8, 
                       2008: 3.8, 
                       2009: -0.4, 
                       2010: 1.6, 
                       2011: 3.2, 
                       2012: 2.1, 
                       2013: 1.5, 
                       2014: 1.6, 
                       2015: np.nan})
inflation

1999    2.2
2000    3.4
2001    2.8
2002    1.6
2003    2.3
2004    2.7
2005    3.4
2006    3.2
2007    2.8
2008    3.8
2009   -0.4
2010    1.6
2011    3.2
2012    2.1
2013    1.5
2014    1.6
2015    NaN
dtype: float64

In [17]:
inflation = pd.Series((2.2, 3.4, 2.8, 1.6, 2.3, 2.7, 3.4, 3.2, 2.8, 3.8, -0.4, 1.6, 3.2, 2.1, 1.6, 1.5))
inflation.index = pd.Index(range(1999, 2015))

In [18]:
inflation.index

RangeIndex(start=1999, stop=2015, step=1)

In [20]:
inflation

1999    2.2
2000    3.4
2001    2.8
2002    1.6
2003    2.3
2004    2.7
2005    3.4
2006    3.2
2007    2.8
2008    3.8
2009   -0.4
2010    1.6
2011    3.2
2012    2.1
2013    1.6
2014    1.5
dtype: float64

In [8]:
inflation[2015] = np.nan
inflation

1999    2.2
2000    3.4
2001    2.8
2002    1.6
2003    2.3
2004    2.7
2005    3.4
2006    3.2
2007    2.8
2008    3.8
2009   -0.4
2010    1.6
2011    3.2
2012    2.1
2013    1.6
2014    1.5
2015    NaN
dtype: float64

In [21]:
inflation.index.name = "Year" 
inflation.name = "Inflation Rate"
inflation

Year
1999    2.2
2000    3.4
2001    2.8
2002    1.6
2003    2.3
2004    2.7
2005    3.4
2006    3.2
2007    2.8
2008    3.8
2009   -0.4
2010    1.6
2011    3.2
2012    2.1
2013    1.6
2014    1.5
Name: Inflation Rate, dtype: float64

In [10]:
inflation.columns = ["%"]
inflation

Year
1999    2.2
2000    3.4
2001    2.8
2002    1.6
2003    2.3
2004    2.7
2005    3.4
2006    3.2
2007    2.8
2008    3.8
2009   -0.4
2010    1.6
2011    3.2
2012    2.1
2013    1.6
2014    1.5
2015    NaN
Name: Inflation Rate, dtype: float64

In [11]:
inflation.head()

Year
1999    2.2
2000    3.4
2001    2.8
2002    1.6
2003    2.3
Name: Inflation Rate, dtype: float64

In [12]:
inflation.tail()

Year
2011    3.2
2012    2.1
2013    1.6
2014    1.5
2015    NaN
Name: Inflation Rate, dtype: float64

## Dataframe: 2-D data vector
https://pandas.pydata.org/docs/reference/frame.html

![](https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png)

### Making dataframe with dictionary of lists

In [None]:
df = pd.DataFrame({"Name": ["Braund, Mr. Owen Harris", 
                            "Allen, Mr. William Henry", 
                            "Bonnell, Miss. Elizabeth",],
                   "Age": [22, 
                           35, 
                           58],
                   "Sex": ["male", 
                           "male", 
                           "female"],})
print(df)
df

### Each column of a dataframe is "series"

In [None]:
print(df["Age"])
print(type(df["Age"]))

* c.f.) Series has no label.

In [None]:
df["Age"].max()

* The describe() method provides a quick overview of the numerical data in a DataFrame.

In [None]:
df.describe()

### Reading tabular data

In [None]:
path = "https://raw.githubusercontent.com/crazytb/schadvmachinelearning/main/"
filename = "niaaa-report2009.csv"

alco2009 = pd.read_csv(path+filename, index_col="State")
alco2009

In [None]:
alco2009["Wine"].head()

In [None]:
alco2009.Beer.tail()

In [None]:
alco2009["Total"] = 0
alco2009.head()

In [None]:
alco2009.columns.values

In [None]:
alco2009.index.values

In [None]:
titanic_file = "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
titanic = pd.read_csv(titanic_file)
titanic

In [None]:
titanic.dtypes

In [None]:
titanic.info()

* When asking for the dtypes, no brackets are used! dtypes is an attribute of a DataFrame and Series. 
* Attributes of a DataFrame or Series do not need brackets. 
* Attributes represent a characteristic of a DataFrame/Series, whereas methods (which require brackets) do something

In [None]:
titanic.to_excel("titanic.xlsx", sheet_name="passengers", index=False)

### Selecting specific columns

In [None]:
ages = titanic["Age"]
ages.head()

In [None]:
ages = titanic.Age
ages.head()

* To select multiple columns, use a list of column names within the selection brackets [].

In [None]:
age_sex = titanic[["Age", "Sex"]]
age_sex.head()

### Boolean indexing
https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-boolean

In [None]:
above_35 = titanic[titanic["Age"] > 35]
above_35.head(10)

## How to select specific rows and columns from a dataframe
![](https://pandas.pydata.org/docs/_images/03_subset_columns_rows.svg)

```python
loc/iloc[row_indexer, (column_indexer)]
```

In [None]:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names.head()

In [None]:
adult_names = titanic.Name[titanic.Age > 35]
adult_names.head()

In [None]:
sliced_1 = titanic.iloc[3:10, 0:2]
sliced_1

In [None]:
sliced_2 = titanic.iloc[[0, 2, 4, 6, 8], [0, 2, 4]]
sliced_2