![title](./pic/selecting/built_in/1_title.png)

In [2]:
import pandas as pd
import random

In [3]:
df = pd.read_csv('./csv/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


<br>

`Pandas` verfügt über einige **eingebaute bedingte Selektoren**, von denen ist die wichtigsten in diesem Notebook hervorhebe. Diese Art von Selektoren sollen dir dabei helfen, schnell einen Überblick über dein `DataFrame` zu erhalten und gewisse Einträge mit gewissen Werten zügig in Erfahrung zu bringen.

---

## `.copy()`

Die Methode Pandas `.copy()` wird verwendet, um eine **Kopie eines Pandas-Objekts** zu erstellen. Variablen werden auch verwendet, um eine Kopie eines Objekts zu erzeugen, aber Variablen sind nur Zeiger auf ein Objekt und jede Änderung der neuen Daten wird auch die vorherigen Daten ändern. Mit `.copy()` wird ein neues Objekt im Arbeitsspeicher erstellt.

![title](./pic/aditional_functions/copy/2_copy.png)

### Ohne `.copy()` sondern mit variabler Zuweisung

In [4]:
df2 = df

In [5]:
df2.loc[0, 'Name'] = 'New Name'

In [6]:
df2.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,New Name,male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [7]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,New Name,male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Hierbei sieht man sehr schön, dass Änderungen im `df2` direkte Auswirkung auf das original `df` haben, da `df2` lediglich eine Referenz von `df` ist und kein eigenständiges Objekt.

### Mit `.copy()`

In [8]:
df3 = df.copy()

In [9]:
df3.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,New Name,male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [10]:
df3.loc[1, 'Name'] = 'New Name Again'

In [11]:
df3.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,New Name,male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,New Name Again,female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [12]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,New Name,male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Erneut wird die Änderung vom `df2` sichtbar, die Änderung die jedoch vom `df3` vorgenommen wurde, wurde auch nur in diesem vorgenommen und ist nicht auf das original `df` übertragen worden.

---

## `.isin()`

Der erste Function ist `.isin()`. Mit `.isin()` kannst du Daten auswählen, deren Wert **"in"** einer Liste von Werten ist.

![title](./pic/selecting/built_in/2_isin.png)

Zum Beispiel können wir damit nur Passagiere des Geschlechtes female auswählen:

In [13]:
df[df.Sex.isin(['female'])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
6,898,1,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
8,900,1,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
...,...,...,...,...,...,...,...,...,...,...,...,...
409,1301,1,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,,S
410,1302,1,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
411,1303,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q
412,1304,1,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.7750,,S


...oder alle Passagiere die sich in der Ersten Klasse befinden

In [14]:
df[df.Pclass.isin([1])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
11,903,0,1,"Jones, Mr. Charles Cresson",male,46.0,0,0,694,26.0000,,S
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
14,906,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",female,47.0,1,0,W.E.P. 5734,61.1750,E31,S
20,912,0,1,"Rothschild, Mr. Martin",male,55.0,1,0,PC 17603,59.4000,,C
22,914,1,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
403,1295,0,1,"Carrau, Mr. Jose Pedro",male,17.0,0,0,113059,47.1000,,S
404,1296,0,1,"Frauenthal, Mr. Isaac Gerald",male,43.0,1,0,17765,27.7208,D40,C
407,1299,0,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C
411,1303,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q


Jedoch muss bei `.isin()` immer die komplette Condition angegeben werden, es funktioniert hierbei nicht, für weibliche Passagiere lediglich nach `"f"` zu suchen...

In [15]:
df[df.Sex.isin(['f'])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


Hierbei bringt die folgende Methode Abhilfe

---

## `.contains()`

Anders als bei `.isin()` benötigt `.contains()` nicht die vollständige Condition. Wenn wir z.B. nur den Vornamen der Person wissen, nach der wir suchen wollen, ansonsten jedoch nicht mehr, können wir `.contains()` verwenden. `Pandas` sucht dann nach allen Einträgen im `DataFrame`, die an irgend einer Stelle diesen `String` (oder auch einen einzelnen `Char`) beinhaltet.

![title](./pic/selecting/built_in/6_contains.png)

In [16]:
df.Sex.str.contains('f')

0      False
1       True
2      False
3      False
4       True
       ...  
413    False
414     True
415    False
416    False
417    False
Name: Sex, Length: 418, dtype: bool

In [17]:
df[df.Sex.str.contains('f')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
6,898,1,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
8,900,1,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
...,...,...,...,...,...,...,...,...,...,...,...,...
409,1301,1,3,"Peacock, Miss. Treasteall",female,3.0,1,1,SOTON/O.Q. 3101315,13.7750,,S
410,1302,1,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
411,1303,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q
412,1304,1,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.7750,,S


In [18]:
df.Name.str.contains('an', case=False)

0      False
1      False
2       True
3      False
4       True
       ...  
413    False
414     True
415    False
416    False
417    False
Name: Name, Length: 418, dtype: bool

In [19]:
df[df.Name.str.contains('an', case=False)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,0,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S
7,899,0,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S
14,906,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",female,47.0,1,0,W.E.P. 5734,61.1750,E31,S
...,...,...,...,...,...,...,...,...,...,...,...,...
397,1289,1,1,"Frolicher-Stehli, Mrs. Maxmillian (Margaretha ...",female,48.0,1,1,13567,79.2000,B41,C
408,1300,1,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q
410,1302,1,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
411,1303,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q


In [22]:
# !!
# df[df.Cabin.str.contains('B', case=False)]

ValueError: Cannot mask with non-boolean array containing NA / NaN values

In [23]:
df[df.Cabin.str.contains('B', case=False, na=False)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
26,918,1,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
59,951,1,1,"Chaudanson, Miss. Victorine",female,36.0,0,0,PC 17608,262.375,B61,C
64,956,0,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C
92,984,1,1,"Davidson, Mrs. Thornton (Orian Hays)",female,27.0,1,2,F.C. 12750,52.0,B71,S
142,1034,0,1,"Ryerson, Mr. Arthur Larned",male,61.0,1,3,PC 17608,262.375,B57 B59 B63 B66,C
166,1058,0,1,"Brandeis, Mr. Emil",male,48.0,0,0,PC 17591,50.4958,B10,C
184,1076,1,1,"Douglas, Mrs. Frederick Charles (Mary Helene B...",female,27.0,1,1,PC 17558,247.5208,B58 B60,C
215,1107,0,1,"Head, Mr. Christopher",male,42.0,0,0,113038,42.5,B11,S


---

## `.notnull()`

Die zweite ist `.notnull()`. Mit diesen Methoden kannst du Werte hervorheben, nicht leer sind (`NaN`).

![title](./pic/selecting/built_in/3_notnull.png)

`.notnull()` kannst du entweder auf das **komplette** `DataFrame` anwenden und kriegst alle Zeilen, die keinen einzigen `Nan` Value in keiner einzigen Spalte beinhalten...

In [44]:
df.notnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,True,True,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,False,True,True,True,True,False,True
2,True,True,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,False,True,True,True,True,False,True
4,True,True,True,True,True,True,True,True,True,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
418,True,True,True,True,True,True,True,True,True,True,False,True
419,True,True,True,True,True,True,True,True,True,True,False,True
420,True,True,True,True,True,False,True,True,True,True,False,True
421,True,True,True,True,True,True,True,True,True,True,False,True


In [8]:
df[df.notnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


...oder aber auch auf **vereinzelte Spalten**, wobei sich dann in anderen Spalten der Zeile wiederrum `Nan` Values befinden können.

In [9]:
df.loc[df.Cabin.notnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
14,906,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",female,47.0,1,0,W.E.P. 5734,61.1750,E31,S
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.3750,B57 B59 B63 B66,C
26,918,1,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
28,920,0,1,"Brady, Mr. John Bertram",male,41.0,0,0,113054,30.5000,A21,S
...,...,...,...,...,...,...,...,...,...,...,...,...
404,1296,0,1,"Frauenthal, Mr. Isaac Gerald",male,43.0,1,0,17765,27.7208,D40,C
405,1297,0,2,"Nourney, Mr. Alfred (Baron von Drachstedt"")""",male,20.0,0,0,SC/PARIS 2166,13.8625,D38,C
407,1299,0,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C
411,1303,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q


---

## `.isnull()` / `.isna()`

Die beiden Methoden `isnull` und `.isna()` geben dir die Zeilen deines `DataFrames` zurück, die `Nan` Values enthalten. Mit diesen Methoden kannst du Werte hervorheben, die leer sind (`NaN`).

![title](./pic/selecting/built_in/4_isnull.png)

Um zum Beispiel Passagiere herauszufilter, deren Alter nicht angegeben ist, würden wir wie folgt vorgehen:

Zuerst kann das ganze `DataFrame` abgefragt werden und man erhält einen (nicht ganz so übersichtliche) Überblick über alle Zeilen und alle Spalten und ob sich in der jeweiligen Zelle ein `Nan` Value befindet oder nicht

In [45]:
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,True,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
418,False,False,False,False,False,False,False,False,False,False,True,False
419,False,False,False,False,False,False,False,False,False,False,True,False
420,False,False,False,False,False,True,False,False,False,False,True,False
421,False,False,False,False,False,False,False,False,False,False,True,False


Danach kann man sich die spezielle Spalte ansehen, die einen interessert:

In [24]:
df.Age.isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
413     True
414    False
415    False
416     True
417     True
Name: Age, Length: 418, dtype: bool

Diese Abfrage können wir in `[]` packen, um die jeweiligen Zeilen des `DataFrames` zu erhalten

In [27]:
df_age_na = df[df.Age.isnull()]
df_age_na

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
10,902,0,3,"Ilieff, Mr. Ylio",male,,0,0,349220,7.8958,,S
22,914,1,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
29,921,0,3,"Samaan, Mr. Elias",male,,2,0,2662,21.6792,,C
33,925,1,3,"Johnston, Mrs. Andrew G (Elizabeth Lily"" Watson)""",female,,1,2,W./C. 6607,23.4500,,S
36,928,1,3,"Roth, Miss. Sarah A",female,,0,0,342712,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
408,1300,1,3,"Riordan, Miss. Johanna Hannah""""",female,,0,0,334915,7.7208,,Q
410,1302,1,3,"Naughton, Miss. Hannah",female,,0,0,365237,7.7500,,Q
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


Wenn dich speziell der Index interessiert, wo in welchem Fall ein `Nan` Value zu finden ist, kannst du den `.index` Operator an dein `DataFrame` anfügen, und dir werden alle Indexe zurück gegeben, die sich in dem `DataFrame` befinden

In [28]:
df_age_na.index

Int64Index([ 10,  22,  29,  33,  36,  39,  41,  47,  54,  58,  65,  76,  83,
             84,  85,  88,  91,  93, 102, 107, 108, 111, 116, 121, 124, 127,
            132, 133, 146, 148, 151, 160, 163, 168, 170, 173, 183, 188, 191,
            199, 200, 205, 211, 216, 219, 225, 227, 233, 243, 244, 249, 255,
            256, 265, 266, 267, 268, 271, 273, 274, 282, 286, 288, 289, 290,
            292, 297, 301, 304, 312, 332, 339, 342, 344, 357, 358, 365, 366,
            380, 382, 384, 408, 410, 413, 416, 417],
           dtype='int64')

In [29]:
df_age_na.index[0]

10

In [30]:
df_age_na.index[1]

22

***

In [12]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [13]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Dabei spielt es überhaupt keine Rolle, welche der beiden Funktionen du verwendest. Bei Recherchen bin ich auf den Pandas Code gestoßen ([Official Pandas Code auf Github](https://github.com/pandas-dev/pandas/blob/0409521665bd436a10aea7e06336066bf07ff057/pandas/core/dtypes/missing.py#L109)) wo eindeutig hinterlegt ist, dass `.isnull()` eine Referenz auf `.isna()` darstellt. Daher kannst du frei zwischen einen der beiden wählen.

![title](./pic/selecting/built_in/7_isna_isnull.png)

---

## `.duplicated()`

Diese Methode macht genau das, nachdem sie sich anhört: sie gibt dir Duplikate zurück. Ist also ein Passagier versehentlich zwei mal eingetragen worden, kann dieser die Statistik verfälschen. Solche Duplikate will man direkt zu Beginn sämtlicher analysen erkennen und eliminieren.

### Vorbereitung: Duplikate einbauen

In [14]:
for i in range(0, 5):
    random_index = random.randint(0, len(df))
    print(random_index, df.loc[random_index, 'PassengerId'])
    
    df = df.append(df.loc[random_index])

367 1259
42 934
299 1191
255 1147
117 1009


In [15]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
367,1259,1,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22.0,0,0,3101295,39.6875,,S
42,934,0,3,"Goldsmith, Mr. Nathan",male,41.0,0,0,SOTON/O.Q. 3101263,7.8500,,S
299,1191,0,3,"Johansson, Mr. Nils",male,29.0,0,0,347467,7.8542,,S
255,1147,0,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.5500,,S


In [16]:
df = df.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,918,1,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
1,1024,1,3,"Lefebre, Mrs. Frank (Frances)",female,,0,4,4133,25.4667,,S
2,956,0,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.3750,B57 B59 B63 B66,C
3,1136,0,3,"Johnston, Master. William Arthur Willie""""",male,,1,2,W./C. 6607,23.4500,,S
4,1252,0,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.5500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
418,1172,1,3,"Oreskovic, Miss. Jelka",female,23.0,0,0,315085,8.6625,,S
419,1095,1,2,"Quick, Miss. Winifred Vera",female,8.0,1,1,26360,26.0000,,S
420,1274,1,3,"Risien, Mrs. Samuel (Emma)",female,,0,0,364498,14.5000,,S
421,1161,0,3,"Pokrnic, Mr. Mate",male,17.0,0,0,315095,8.6625,,S


![title](./pic/selecting/built_in/5_duplicated.png)

In [17]:
df_duplicates = df[df.duplicated()]
df_duplicates

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
169,1259,1,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22.0,0,0,3101295,39.6875,,S
184,1191,0,3,"Johansson, Mr. Nils",male,29.0,0,0,347467,7.8542,,S
185,1147,0,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.55,,S
237,1009,1,3,"Sandstrom, Miss. Beatrice Irene",female,1.0,1,1,PP 9549,16.7,G6,S
347,934,0,3,"Goldsmith, Mr. Nathan",male,41.0,0,0,SOTON/O.Q. 3101263,7.85,,S


In [18]:
df.loc[134, 'Name'] = 'Neuer Name'
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,918,1,1,"Ostby, Miss. Helene Ragnhild",female,22.0,0,1,113509,61.9792,B36,C
1,1024,1,3,"Lefebre, Mrs. Frank (Frances)",female,,0,4,4133,25.4667,,S
2,956,0,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.3750,B57 B59 B63 B66,C
3,1136,0,3,"Johnston, Master. William Arthur Willie""""",male,,1,2,W./C. 6607,23.4500,,S
4,1252,0,3,"Sage, Master. William Henry",male,14.5,8,2,CA. 2343,69.5500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
418,1172,1,3,"Oreskovic, Miss. Jelka",female,23.0,0,0,315085,8.6625,,S
419,1095,1,2,"Quick, Miss. Winifred Vera",female,8.0,1,1,26360,26.0000,,S
420,1274,1,3,"Risien, Mrs. Samuel (Emma)",female,,0,0,364498,14.5000,,S
421,1161,0,3,"Pokrnic, Mr. Mate",male,17.0,0,0,315095,8.6625,,S


In [19]:
df_duplicates = df[df.duplicated()]
df_duplicates

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
169,1259,1,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22.0,0,0,3101295,39.6875,,S
184,1191,0,3,"Johansson, Mr. Nils",male,29.0,0,0,347467,7.8542,,S
185,1147,0,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.55,,S
237,1009,1,3,"Sandstrom, Miss. Beatrice Irene",female,1.0,1,1,PP 9549,16.7,G6,S
347,934,0,3,"Goldsmith, Mr. Nathan",male,41.0,0,0,SOTON/O.Q. 3101263,7.85,,S


<br>

Mit dem Attribut `subset=` kann die Duplikatssuche auf eine oder mehrere Spalten beschränkt werden. Dadurch wird nicht die ganze Zeile mit jeder weiteren Zeile verglichen, sondern lediglich die jeweilige Spalte der Zeile mit der Spalte der anderen Zeilen. Somit kann verfeinert nach Duplikaten gesucht und präziser abgefragt werden.

In [20]:
df_duplicates_col = df[df.duplicated(subset='PassengerId')]
df_duplicates_col

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
169,1259,1,3,"Riihivouri, Miss. Susanna Juhantytar Sanni""""",female,22.0,0,0,3101295,39.6875,,S
184,1191,0,3,"Johansson, Mr. Nils",male,29.0,0,0,347467,7.8542,,S
185,1147,0,3,"MacKay, Mr. George William",male,,0,0,C.A. 42795,7.55,,S
237,1009,1,3,"Sandstrom, Miss. Beatrice Irene",female,1.0,1,1,PP 9549,16.7,G6,S
347,934,0,3,"Goldsmith, Mr. Nathan",male,41.0,0,0,SOTON/O.Q. 3101263,7.85,,S


Wie du diese Duplikate anschließend erfolgreich löschen kannst, behandeln wir zu einem späteren Zeitpunkt, da es thematisch nicht zu diesem Notebook passt.