## 01 What is pandas?
- Series: 1-dimensional
- DataFrame: multi-dimensional

## 02 Creating DataFrames

#### DataFrames

In [1]:
import pandas as pd

data = {
    'ages': [14, 18, 24, 42],
    'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data)
print(df)

   ages  heights
0    14      165
1    18      180
2    24      176
3    42      184


In [2]:
df = pd.DataFrame(data, index=['James', 'Bob', 'Amy', 'Dave'])
print(df.loc['Bob']) #loc ruft den index-wert auf

ages        18
heights    180
Name: Bob, dtype: int64


#### Practice

You are given a dictionary that contains names and numbers of people. You need to create a DataFrame from the dictionary and add an index to it, which should be the name values.
Then take a name from user input and output the row in the DataFrame, which corresponds to that row. 

In [4]:
import pandas as pd

data = {
   'name': ['James', 'Billy', 'Bob', 'Amy', 'Tom'],
   'number': ['1234', '5678', '2222', '1111', '0909']
}


df = pd.DataFrame(data, index=['James', 'Billy', 'Bob', 'Amy', 'Tom'])
print(df.loc[input()])

James
name      James
number     1234
Name: James, dtype: object


## 03 Indexing and Slicing

#### Indexing

In [7]:
import pandas as pd

data = {
    'ages': [14, 18, 24, 42],
    'heights': [165, 180, 176, 184]
}

df = pd.DataFrame(data)

#zugriff auf ages
print(df['ages'])

#zugriff auf mehrere spalten
print(df[['ages', 'heights']])

0    14
1    18
2    24
3    42
Name: ages, dtype: int64
   ages  heights
0    14      165
1    18      180
2    24      176
3    42      184


#### Slicing
- iloc greift auf Index zu

In [8]:
print(df.iloc[2]) #2
print(df.iloc[:3]) #0-3
print(df.iloc[1:3]) #1-3
print(df.iloc[-2:]) #letzten 2

ages        24
heights    176
Name: 2, dtype: int64
   ages  heights
0    14      165
1    18      180
2    24      176
   ages  heights
1    18      180
2    24      176
   ages  heights
2    24      176
3    42      184


#### Conditions

In [10]:
#bedingung: alter > 18 und größe > 180
df[(df['ages']>18) & (df['heights']>180)]

#bedingung: alter > 18 oder größe > 180
df[(df['ages']>18) | (df['heights']>180)]

Unnamed: 0,ages,heights
2,24,176
3,42,184


#### Practice

You are given a DataFrame that includes the names and ranks of people.
You need to take a rank as input and output the corresponding name column from the DataFrame as a Series. 

In [11]:
import pandas as pd

data = {
   'name': ['James', 'Billy', 'Bob', 'Amy', 'Tom', 'Harry'],
   'rank': [4, 1, 3, 5, 2, 6]
}

df = pd.DataFrame(data, index=data['name'])

user_rank = int(input())
print(df['name'][df['rank']==user_rank])

4
James    James
Name: name, dtype: object


## 04 Reading Data

#### head() and tail()

In [15]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

print(df.head()) #head zeigt immer die ersten 5 einträge
print(df.tail()) #tail zeigt immer die letzten 5 einträge
print(df.head(7)) #7 einträge

       date       state  cases  deaths
0  25.01.20  California      1       0
1  26.01.20  California      1       0
2  27.01.20  California      0       0
3  28.01.20  California      0       0
4  29.01.20  California      0       0
         date       state  cases  deaths
337  27.12.20  California  37555      62
338  28.12.20  California  41720     246
339  29.12.20  California  34166     425
340  30.12.20  California  32386     437
341  31.12.20  California  32264     574
       date       state  cases  deaths
0  25.01.20  California      1       0
1  26.01.20  California      1       0
2  27.01.20  California      0       0
3  28.01.20  California      0       0
4  29.01.20  California      0       0
5  30.01.20  California      0       0
6  31.01.20  California      1       0


#### info() and set_index()

In [17]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 342 entries, 0 to 341
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    342 non-null    object
 1   state   342 non-null    object
 2   cases   342 non-null    int64 
 3   deaths  342 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 10.8+ KB
None


In [22]:
#index-zugriff: zeigt dann nur die einträge zu (hier) datum an
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
df.set_index("date", inplace=True)

print(df.head())

               state  cases  deaths
date                               
25.01.20  California      1       0
26.01.20  California      1       0
27.01.20  California      0       0
28.01.20  California      0       0
29.01.20  California      0       0


#### Dropping a Column
- drop() deletes rows and columns.
- axis=1 specifies that we want to drop a column.
- axis=0 will drop a row. 

In [25]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
df.set_index('date', inplace=True)
df.drop('state', axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 342 entries, 25.01.20 to 31.12.20
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   cases   342 non-null    int64
 1   deaths  342 non-null    int64
dtypes: int64(2)
memory usage: 8.0+ KB


#### Practice

You are working with the 'ca-covid' CSV file that contains the COVID-19 infection data in California for the year 2020.
The file provides data on daily cases and deaths for the entire year.
Find and output the row that corresponds to December 31, 2020. 

In [54]:
import pandas as pd

#df = pd.read_csv("/usercode/files/ca-covid.csv")
df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")
df.set_index('date', inplace=True)
df.drop('state', axis=1, inplace=True)

print(df.iloc[-1])

cases     32264
deaths      574
Name: 31.12.20, dtype: int64


## 05 Working with Data

#### Creating Columns

In [55]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)

df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name()

df.set_index('date', inplace=True)

print(df.head())

#ergebnis: eigene spalten erstellen (monat wird hinzugefügt auf basis des datums)
#date wird zu month konvertiert in eigener spalte

          cases  deaths    month
date                            
25.01.20      1       0  January
26.01.20      1       0  January
27.01.20      0       0  January
28.01.20      0       0  January
29.01.20      0       0  January


#### Summary Statistics

In [4]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)
df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name()
df.set_index('date', inplace=True)

print(df.describe()) #describe gibt alle statistischen berechnungen zum betrachteten file aus
print('')
print(df['cases'].describe) #für die betrachtung einzelner/ausgewählter spalten

              cases      deaths
count    342.000000  342.000000
mean    6747.862573   75.921053
std    10023.201267   76.639861
min        0.000000   -5.000000
25%     1352.250000   22.000000
50%     3462.500000   62.500000
75%     7637.250000  104.000000
max    64987.000000  574.000000

<bound method NDFrame.describe of date
25.01.20        1
26.01.20        1
27.01.20        0
28.01.20        0
29.01.20        0
            ...  
27.12.20    37555
28.12.20    41720
29.12.20    34166
30.12.20    32386
31.12.20    32264
Name: cases, Length: 342, dtype: int64>


#### Practice

In [16]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)
df['date'] = pd.to_datetime(df['date'], format="%d.%m.%y")

df['weekday'] = df['date'].dt.strftime('%A') #.dt.strftime("%A") konvertiert date in weekday
print(df[-7:])

          date  cases  deaths    weekday
335 2020-12-25  16772      20     Friday
336 2020-12-26  64987     257   Saturday
337 2020-12-27  37555      62     Sunday
338 2020-12-28  41720     246     Monday
339 2020-12-29  34166     425    Tuesday
340 2020-12-30  32386     437  Wednesday
341 2020-12-31  32264     574   Thursday


#### Beispiel

In [21]:
import pandas as pd

data = {
    'height': [133, 120, 180, 100],
    'age': [9, 7, 16, 4]
}

df = pd.DataFrame(data)
print(df['age'].mean())

9.0


## 06 Grouping

#### Werte zählen: .value_counts()

In [23]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)
df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name()
df.set_index('date', inplace=True)

print(df['month'].value_counts()) #gibt aus, wie viele werte in jedem monat berücksichtigt wurden
#.value_counts(): wie oft ein wert im datenset auftaucht

March        31
May          31
July         31
August       31
October      31
December     31
April        30
June         30
September    30
November     30
February     29
January       7
Name: month, dtype: int64


#### Werte verrechnen

In [77]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)
df['month'] = pd.to_datetime(df['date'], format="%d.%m.%y").dt.month_name()
df.set_index('date', inplace=True)

print(df.groupby('month')['cases'].sum()) #fälle im monat, .groupby() gruppiert daten nach einer festgelegten spalte
print('')
print('Gesamte Covid-Fälle im Jahr:', df['cases'].sum(), '\n') #fälle im jahr
print('Höchste Anzahl an Covid-Fällen:', df['cases'].max(), '\n')
print('KLeinste Anzahl an Covid-Fällen:', df['cases'].min(), '\n')
print('Durchschnitt an Covid-Fällen:', df['cases'].mean())

month
April          41887
August        210268
December     1070577
February          25
January            3
July          270120
June          119039
March           8555
May            62644
November      301944
October       114123
September     108584
Name: cases, dtype: int64

Gesamte Covid-Fälle im Jahr: 2307769 

Höchste Anzahl an Covid-Fällen: 64987 

KLeinste Anzahl an Covid-Fällen: 0 

Durchschnitt an Covid-Fällen: 6747.862573099415


#### Practice

In [76]:
import pandas as pd

df = pd.read_csv("https://www.sololearn.com/uploads/ca-covid.csv")

df.drop('state', axis=1, inplace=True)
df['date'] = pd.to_datetime(df['date'], format="%d.%m.%y")
df['month'] = df['date'].dt.month_name()
df.set_index('date', inplace=True)

month_user = input()
maxim = df[df['month']==month_user]['cases'].max()
print(df[df['cases']==maxim])

April
            cases  deaths  month
date                            
2020-04-29   2334      77  April


In [79]:
#beispiel: max. alter für jeden namen finden:
#df.goupby('name')['age'].max()