# Знакомство с pandas

[pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) — программная библиотека на языке Python для обработки и анализа данных. Работа pandas с данными строится поверх библиотеки NumPy, являющейся инструментом более низкого уровня. Предоставляет специальные структуры данных и операции для манипулирования числовыми таблицами и временны́ми рядами. Название библиотеки происходит от эконометрического термина «панельные данные» (англ. panel data), используемого для описания многомерных структурированных наборов информации. 

Библиотека оптимизирована для высокой производительности, наиболее важные части кода написаны на Си. 

In [1]:
import pandas as pd

pandas предоставляет несколько новых типов данных

**Series** – это проиндексированный одномерный массив значений. 

In [2]:
my_series = pd.Series([1, 2, 3, 4, 5, 6])
my_series

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

Теперь нам не нужны циклы для операций над элементами. Векторные операцию и так применяются ко всем элементам массива и намного быстрее

In [3]:
my_series ** 2

0     1
1     4
2     9
3    16
4    25
5    36
dtype: int64

**DataFrame** – проиндексированный многомерный массив значений (табличная структура данных). 
Одни из возможных вариантов создания датафрейма -  из словаря (функция pd.DataFrame()), либо импортировав из какого-либо источника данных – файла.

Каждый столбец датафрейма, является структурой Series. 

In [4]:
# создадим датафрейм из словаря

my_df = pd.DataFrame({
    'City': ['MSK', 'SPB', 'EKB'],
    'Temp': [-3, -5, -2],
    'State': ['Cloudy', 'Clear', 'Snow']
})
my_df

Unnamed: 0,City,Temp,State
0,MSK,-3,Cloudy
1,SPB,-5,Clear
2,EKB,-2,Snow


**Dataframe Index**

По-умолчанию в качестве индексов строк присваиваются порядковые номера (начиная с нуля).
Задать собственные индексы можно при помощи .index:  
`my_df.index = [label1, label2, label3...]`  
Также один из столбцов датафрейма можно обратить в индекс:  
`my_df.set_index([‘столбец’])`  
Параметр inplace принимает логическое значения и указывает менять ли исходный датафрейм или создать новый (он работает очень во многих функциях pandas)

Индексы могут быть многоуровневыми!

In [5]:
my_df.set_index(['City'], inplace=True)
my_df

Unnamed: 0_level_0,Temp,State
City,Unnamed: 1_level_1,Unnamed: 2_level_1
MSK,-3,Cloudy
SPB,-5,Clear
EKB,-2,Snow


Создание датафрейма из файла возможно при помощи функций:
- `pd.read_csv(‘путь до файла’)`
- `pd.read_excel(‘путь до файла’, sheet_name = ‘страница’)`
- `pd.read_html(‘путь до файла’)`
- ...  
Сохранить датафрейм в файл возможно при помощи:
- `df.to_csv(‘путь до файла’)`
- `df.to_excel(‘путь до файла’)`
- ...


Некоторые атрибуты функций открытия файлов  
Атрибут na_values позволяет указать значения, которые обозначают отсутствие данных в исходном файле и заменяет их на NaN (not a number) – особый тип данных, означающий отсутствие значения.  
Атрибут parse_dates, позволяет указать столбцы, которые содержат даты, и приведет их к особому типу данных datetime.  
Атрибут header позволяет указать является ли первая строка в файле строкой заголовков.  
Атрибут names позволяет присвоить столбцам нужные имена.

In [6]:
df = pd.read_csv('https://raw.githubusercontent.com/obulygin/SkillFactory/main/train.csv') 
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [7]:
df.to_excel('titanic.xlsx')

### Методы для быстрого изучения датафрейма

- метод .head(n) возвращает первые n строк датафрейма;
- метод .tail(n) возвращает последние n строк датафрейма;
- метод .describe() позволяет просмотреть основные статистики по датафрейму, также применим к отдельным столбцам датафрейма;
- размерность датафрейма можно узнать при помощи атрибута .shape;
- метод .value_counts() позволяет посчитать уникальное количество значений по столбцу.

Данные методы удобны для первоначального просмотра и знакомства с данными.



In [8]:
df.head()
# df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
# df.tail()
df.tail(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [10]:
df.shape

(891, 12)

In [11]:
#Описываем датасет с технической точки зрения: длина, типы данных и пропущенные значения
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [12]:
# описываем датасет со статистической точки зрения: количество значений, среднее, стандартное отклонение и т.д
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [13]:
# все эти метрики (и другие) можно посчитать и вручную

print(df['Age'].mean())
print(df['Age'].median())
print(df['Age'].mode()[0])
print(df['Age'].std())
print(df['Age'].var())

29.69911764705882
28.0
24.0
14.526497332334044
211.0191247463081


### Выборки столбцов из датафрейма

In [14]:
# так мы получим Series
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [15]:
# так мы получим датафрейм
df[['Name']]

Unnamed: 0,Name
0,"Braund, Mr. Owen Harris"
1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,"Heikkinen, Miss. Laina"
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,"Allen, Mr. William Henry"
...,...
886,"Montvila, Rev. Juozas"
887,"Graham, Miss. Margaret Edith"
888,"Johnston, Miss. Catherine Helen ""Carrie"""
889,"Behr, Mr. Karl Howell"


In [16]:
# а так что?
df['Name', 'Sex']

KeyError: ('Name', 'Sex')

In [None]:
df[['Name', 'Sex']]

Unnamed: 0,Name,Sex
0,"Braund, Mr. Owen Harris",male
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,"Heikkinen, Miss. Laina",female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
4,"Allen, Mr. William Henry",male
...,...,...
886,"Montvila, Rev. Juozas",male
887,"Graham, Miss. Margaret Edith",female
888,"Johnston, Miss. Catherine Helen ""Carrie""",female
889,"Behr, Mr. Karl Howell",male


In [None]:
# мы можем посмотреть на распределение значений переменной
df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [None]:
df['Sex'].value_counts(normalize=True) * 100

male      64.758698
female    35.241302
Name: Sex, dtype: float64

In [None]:
df[['Sex', 'Pclass']].value_counts()

Sex     Pclass
male    3         347
female  3         144
male    1         122
        2         108
female  1          94
        2          76
dtype: int64

In [None]:
df['Age'].unique()

array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])

Проиндексировав датафрейм: df[0:4] можно получить необходимую выборку “строк” (в данном случае с первой по третью).

In [None]:
df[:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df[5:10:2]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


.loc - используется для доступа по строковой метке (индексу);

In [None]:
df.loc[1:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [None]:
# можем сразу получить нужную комбинацию индекса и столбцов
df.loc[1:5, ['Name', 'Sex']]

Unnamed: 0,Name,Sex
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,"Heikkinen, Miss. Laina",female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
4,"Allen, Mr. William Henry",male
5,"Moran, Mr. James",male


In [None]:
# сделаем id пассажира индексом
df.set_index('PassengerId', inplace=True)

In [None]:
# теперь loc будет работать именно по значениям индекса
df.loc[1:5, ['Name', 'Sex']]

Unnamed: 0_level_0,Name,Sex
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"Braund, Mr. Owen Harris",male
2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
3,"Heikkinen, Miss. Laina",female
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
5,"Allen, Mr. William Henry",male


In [None]:
# .iloc - используется для доступа по порядковому номеру строки/столбца(начиная от 0).
df.iloc[2:8:3, 5:7] 

Unnamed: 0_level_0,SibSp,Parch
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
3,0,0
6,0,0


### Выборки по условию

In [None]:
df['Age'] > 30

PassengerId
1      False
2       True
3      False
4       True
5       True
       ...  
887    False
888    False
889    False
890    False
891     True
Name: Age, Length: 891, dtype: bool

In [None]:
df[df['Age'] > 30]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...
874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


In [None]:
# and сооответсвует &
# or соответствует |

df[(df.Age > 20) & (df.Age < 40)]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
df[(df.Age >= 20) & (df.Age <= 40) & (df.Pclass == 1)] 

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5000,A6,S
31,0,1,"Uruchurtu, Don. Manuel E",male,40.0,0,0,PC 17601,27.7208,,C
35,0,1,"Meyer, Mr. Edgar Joseph",male,28.0,1,0,PC 17604,82.1708,,C
...,...,...,...,...,...,...,...,...,...,...,...
836,1,1,"Compton, Miss. Sara Rebecca",female,39.0,1,1,PC 17756,83.1583,E49,C
843,1,1,"Serepeca, Miss. Augusta",female,30.0,0,0,113798,31.0000,,C
868,0,1,"Roebling, Mr. Washington Augustus II",male,31.0,0,0,PC 17590,50.4958,A24,S
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S


In [None]:
# contains позволяет проверить наличие подстроки в строке
james = df[df['Name'].str.contains('James')]
james

# таких методов очень много

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
68,0,3,"Crease, Mr. Ernest James",male,19.0,0,0,S.P. 3464,8.1583,,S
135,0,2,"Sobey, Mr. Samuel James Hayden",male,25.0,0,0,C.A. 29178,13.0,,S
151,0,2,"Bateman, Rev. Robert James",male,51.0,0,0,S.O.P. 1166,12.525,,S
162,1,2,"Watt, Mrs. James (Elizabeth ""Bessie"" Inglis Mi...",female,40.0,0,0,C.A. 33595,15.75,,S
175,0,1,"Smith, Mr. James Clinch",male,56.0,0,0,17764,30.6958,A7,C
195,1,1,"Brown, Mrs. James Joseph (Margaret Tobin)",female,44.0,0,0,PC 17610,27.7208,B4,C
222,0,2,"Bracken, Mr. James H",male,27.0,0,0,220367,13.0,,S
251,0,3,"Reed, Mr. James George",male,,0,0,362316,7.25,,S
300,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C


**Практика**  
1) Выберите из датасета только тех пассахажиров, которые сели в порту S (столбец embarked);  
2) Посчитайте их средний возраст;  
3) Посчитайте распределение по полу и по классу среди этой выборки.  

In [18]:
df[df.Embarked == 'S']['Age'].mean()

29.44539711191336

In [20]:
df[df.Embarked == 'S'][['Pclass', 'Sex']].value_counts()

Pclass  Sex   
3       male      265
2       male       97
3       female     88
1       male       79
2       female     67
1       female     48
dtype: int64

In [None]:
# df[df['Embarked'] == 'S']
# df[df['Embarked'] == 'S'].Age.mean()
# df[df['Embarked'] == 'S'].Sex.value_counts()
df[df['Embarked'] == 'S'].Pclass.value_counts()

3    353
2    164
1    127
Name: Pclass, dtype: int64

### Сортировка

In [None]:
df.sort_values('Fare', ascending=False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.3292,,C
738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
680,1,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0000,C23 C25 C27,S
28,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,19950,263.0000,C23 C25 C27,S
...,...,...,...,...,...,...,...,...,...,...,...
634,0,1,"Parr, Mr. William Henry Marsh",male,,0,0,112052,0.0000,,S
414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0000,,S
823,0,1,"Reuchlin, Jonkheer. John George",male,38.0,0,0,19972,0.0000,,S
733,0,2,"Knight, Mr. Robert J",male,,0,0,239855,0.0000,,S


In [None]:
df.sort_values('Age', ascending=False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q
...,...,...,...,...,...,...,...,...,...,...,...
860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


In [None]:
# можно отсортировать сразу по нескольким столбцам
df.sort_values(['Age', 'Sex'], ascending=False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S
852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.7500,,Q
...,...,...,...,...,...,...,...,...,...,...,...
728,1,3,"Mannion, Miss. Margareth",female,,0,0,36866,7.7375,,Q
793,0,3,"Sage, Miss. Stella Anna",female,,8,2,CA. 2343,69.5500,,S
850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)",female,,1,0,17453,89.1042,C92,C
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S


In [None]:
# можно сортировать и по индексу
df.sort_index(ascending=False)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


### Пропущенные значения

In [None]:
# так мы можем посчитать пропуски
df['Age'].isna()

PassengerId
1      False
2      False
3      False
4      False
5      False
       ...  
887    False
888    False
889     True
890    False
891    False
Name: Age, Length: 891, dtype: bool

In [None]:
df['Age'].isna().sum()

177

In [None]:
# посчитаем пропуски во всех столбцах
for col in df.columns:
    pct_missing = df[col].isna().mean()
    print(f'{col} - {pct_missing :.1%}')

Survived - 0.0%
Pclass - 0.0%
Name - 0.0%
Sex - 0.0%
Age - 19.9%
SibSp - 0.0%
Parch - 0.0%
Ticket - 0.0%
Fare - 0.0%
Cabin - 77.1%
Embarked - 0.2%


In [None]:
# удаление строк с пропусками
df.dropna()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [None]:
# предположим, мы хотим удалить только те строки, в которых как минимум 11/12 значений заполнено
df.dropna(thresh=11).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 2 to 890
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  183 non-null    int64  
 1   Pclass    183 non-null    int64  
 2   Name      183 non-null    object 
 3   Sex       183 non-null    object 
 4   Age       183 non-null    float64
 5   SibSp     183 non-null    int64  
 6   Parch     183 non-null    int64  
 7   Ticket    183 non-null    object 
 8   Fare      183 non-null    float64
 9   Cabin     183 non-null    object 
 10  Embarked  183 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 17.2+ KB


In [None]:
# удаление столбцов с пропусками

# у нас очень много пропусков в Cabin. Нам эта информация точно нунжа?
df.dropna(axis=1).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Ticket    891 non-null    object 
 7   Fare      891 non-null    float64
dtypes: float64(1), int64(4), object(3)
memory usage: 94.9+ KB


### Группировка данных

Метод .groupby() позволяет сгруппировать датафрейм по столбцу для различных агрегированных расчетов.
Как правило применяется вместо с методами: .sum(), .count(), .min(), .max(), .std(), mean(), median(), var() и т.д.

Принцип действия **groupby**
![](https://i.stack.imgur.com/sgCn1.jpg)

- Разделение данных по группап по определенному критерию

- Применение к каждой группе определенной функции

- Сложение результатов в единую структуру

Метод .agg() позволяет провести несколько агрегированных расчетов одновременно. Применяется в связке с .groupby()

In [None]:
df.groupby('Sex').mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


In [None]:
df.groupby('Sex')['Age'].median()

Sex
female    27.0
male      29.0
Name: Age, dtype: float64

In [None]:
# можем сделать группировку сразу по несколкьким столбцам
df.groupby(['Sex', 'Pclass'])['Age'].median()

Sex     Pclass
female  1         35.0
        2         28.0
        3         21.5
male    1         40.0
        2         30.0
        3         25.0
Name: Age, dtype: float64

In [None]:
# если нужно сделать сразу несколько агрегированных расчетов, то можно применить agg
df.groupby(['Sex', 'Pclass']).agg(['median', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Survived,Age,Age,SibSp,SibSp,Parch,Parch,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,median,max,median,max,median,max,median,max,median,max
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
female,1,1.0,1,35.0,63.0,0.0,3,0.0,2,82.66455,512.3292
female,2,1.0,1,28.0,57.0,0.0,3,0.0,3,22.0,65.0
female,3,0.5,1,21.5,63.0,0.0,8,0.0,6,12.475,69.55
male,1,0.0,1,40.0,80.0,0.0,3,0.0,4,41.2625,512.3292
male,2,0.0,1,30.0,70.0,0.0,2,0.0,2,13.0,73.5
male,3,0.0,1,25.0,74.0,0.0,8,0.0,5,7.925,69.55


In [None]:
df.groupby(['Sex', 'Pclass'])['Age'].agg(['median', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,median,max
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,35.0,63.0
female,2,28.0,57.0
female,3,21.5,63.0
male,1,40.0,80.0
male,2,30.0,70.0
male,3,25.0,74.0


In [None]:
# можем посмотреть выживаемость в разрезе пола и класса
df.groupby(['Sex', 'Pclass'])['Survived'].agg('mean')

Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

In [None]:
# считаем минимальную, среднюю и максимальную стоимость билетов по классам (исключая бесплатные билеты)
df[df['Fare'] > 0].groupby(by='Pclass')['Fare'].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,min,mean,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,5.0,86.148874,512.3292
2,10.5,21.358661,73.5
3,4.0125,13.787875,69.55


In [None]:
# сводные таблицы - еще один способ группировки значений

df.pivot_table(index='Pclass', columns='Sex', values='Age', aggfunc='mean')

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,34.611765,41.281386
2,28.722973,30.740707
3,21.75,26.507589


In [None]:
# параметр margins позволяет добавить "Всего"
df.pivot_table(values='Survived', index='Sex', columns='Pclass', aggfunc='mean', margins=True)

Pclass,1,2,3,All
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838


In [None]:
names = pd.read_csv('names/yob2019.txt', names=['Name', 'Sex', 'Count'])
names.sort_values('Count', ascending=False).head(5)

Unnamed: 0,Name,Sex,Count
17905,Liam,M,20502
17906,Noah,M,19048
0,Olivia,F,18451
1,Emma,F,17102
2,Ava,F,14440


In [None]:
names.groupby('Sex')['Name'].nunique()

Sex
F    17905
M    14049
Name: Name, dtype: int64

### Спасибо за внимание! Буду рад ответить на ваши вопросы
Форма ОС: https://forms.gle/y8xaFwJqtbFSjUeG8