In [1]:
# to make the .py script runnable
#!/usr/bin/env python

In [2]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

In [3]:
import os

# 2.6 Basic Exploration and Sanity checks - Tasks

What follows are a number of tasks in which you will receive or create a dataset to play around with. The main points the allow you to get 'a feel' for the dataset, with functions you create yourself. This is an important and often overlooked part of data science. It allows for sanity checks, which should be done and redone throughout any data science project. A lot of mistakes can be avoided this way, by simple checking of every outcome 'makes sense' (e.g. realising a maximum age in a dataset is 684 years or predicting that a probability is negative). Not all  commands needed have been explained literally previously, so mess around with the build-in Pythin information.

Solutions can be found at the bottom.

## 2.6.1 Tasks

### Task 7

Generate a Series of 150 ages from a normal distribution with a mean of 35 years, pick your own reasonable SD (sanity check!). Set every fifteenth value to missing. Find the new mean. Fill the missing data with (a) mean (b) median, and report the new means.

### Task 8

- Define a function call 'Standardizing' which works on a series as:
    - Find the mean and standard deviation of the series
    - Subtract each value of the series with the mean
    - Divide the result with the standard deviation
    - Returns a series with the results
- Test on a random Series
    - ask the mean and standard deviation of the standardized series
- Apply this function to each numerical column (int or float) of the Titanic Dataset
    - Once by manually checking which columns are float or int
    - Once by letting Python decide 
    - ask the mean and standard deviation of the standardized series
    
Standardizing is a very important technique in statistics, more on this later!

### Task 9

Create a function called DESCRIBE_X that takes as input a Series, 
and produces a series containing

- `min, max, sum, mean, std, missings, nonmissings, skew, kurtosis`
- `percentiles - as specified by the user`

so, it looks like:

`DESCRIBE_X(s_1, percentiles=[0.1, 0.3, 0.92, 0.99]`

Then, apply this function to the 1)rows 2) columns of a random constructed 10 by 5 DataFrame.


### Task 10

- Report the shape of the Titanic DataFrame and the data type of each column.

- Subset the data to retain only the passengers that survived. Name this object 'survivors'
- Within the 'survivors' find the summary statistics of the 'Age' and 'Fare' variables.

- Subset the survivors data to retain only Female passengers.
- Find out how many females over the age of 30 survived.

- Are there any missing values in the Age columnof the survivors dataframe? Did this impact the previous question?
- Make a copy of the survivors dataframe with the missing Age values replaced with the mean.
- Report the new summary statistics on Age. What happened? Why?
- Find out how now many females over the age of 30 survived now. What changed? Does this make sense? What does this say about the previous result? What if the mean age was 31?
 

## 2.6.2 Solutions

### Task 7

Generate a Series of 150 ages from a normal distribution with a mean of 35 years, pick your own reasonable SD (sanity check!). Set every fifteenth value to missing. Find the new mean. Fill the missing data with (a) mean (b) median, and report the new means.

In [4]:
np.random.normal(loc=35, scale=10, size=150)

array([24.62355142, 43.39084382, 51.53442754, 39.27990492, 46.25031052,
       36.55860536, 51.05900622, 43.26816453, 27.02370782, 27.22050074,
       22.7330337 , 30.82498427, 30.42975071, 41.01016633, 50.50534289,
       29.05813542, 27.18291729, 38.76521132, 25.49994803, 32.29436138,
       33.04259458, 26.16645159, 27.90664065,  8.84721504, 18.72995718,
       37.47837415, 46.64799458, 26.02450145, 35.44683774, 31.77087852,
       25.42632919, 34.52104785, 30.33780978, 35.21186946, 46.49049687,
       31.706755  , 37.89219176, 32.14202097, 34.7208261 , 33.27436649,
       32.58399086, 23.79113693, 35.52706186, 35.90391147, 36.08412022,
       30.82549072,  4.84588616, 40.80410361, 29.64656493, 17.30094262,
       37.72714226, 32.041067  , 12.95265927, 27.03290361, 31.33223595,
       29.90606948, 44.70124638, 26.4228755 , 23.79256407,  9.21370252,
       30.98292602, 53.66855738, 39.46814554, 28.35217001, 34.30214558,
       25.68952774, 44.37634328, 32.77017685, 14.92531266, 49.91

In [6]:
ages = pd.Series(np.random.normal(loc=35, scale=10, size=150))
ages

0      23.295854
1      35.418675
2      30.783763
3      35.592637
4      31.441039
5      27.215629
6      40.098145
7      44.626544
8      41.342058
9      35.631908
10     33.334627
11     42.264236
12     32.272869
13     43.294791
14     20.957006
15     38.504357
16     11.717584
17     43.883360
18     32.138184
19     42.436433
20     43.177389
21     49.568468
22     34.063409
23     31.107427
24     29.655758
25     40.267723
26     27.855405
27     27.096122
28     50.964104
29     30.610591
         ...    
120    25.805090
121    28.015848
122    43.023501
123    25.715934
124    48.490441
125    36.195236
126    27.624210
127    51.118389
128    32.964534
129    52.092552
130    34.957576
131    50.425775
132    43.139111
133    22.498836
134    56.145898
135    40.779030
136    23.330684
137    56.212364
138    28.068172
139    41.936312
140    34.657038
141    25.313992
142    33.434110
143    19.664907
144    26.040363
145    18.896453
146    22.527969
147    36.7239

In [7]:
ages.mean()

35.14515073345027

In [8]:
print(ages)

0      23.295854
1      35.418675
2      30.783763
3      35.592637
4      31.441039
5      27.215629
6      40.098145
7      44.626544
8      41.342058
9      35.631908
10     33.334627
11     42.264236
12     32.272869
13     43.294791
14     20.957006
15     38.504357
16     11.717584
17     43.883360
18     32.138184
19     42.436433
20     43.177389
21     49.568468
22     34.063409
23     31.107427
24     29.655758
25     40.267723
26     27.855405
27     27.096122
28     50.964104
29     30.610591
         ...    
120    25.805090
121    28.015848
122    43.023501
123    25.715934
124    48.490441
125    36.195236
126    27.624210
127    51.118389
128    32.964534
129    52.092552
130    34.957576
131    50.425775
132    43.139111
133    22.498836
134    56.145898
135    40.779030
136    23.330684
137    56.212364
138    28.068172
139    41.936312
140    34.657038
141    25.313992
142    33.434110
143    19.664907
144    26.040363
145    18.896453
146    22.527969
147    36.7239

In [9]:
ages.mean()

35.14515073345027

In [10]:
for i in range(len(ages)):
    if (i + 1) % 15 == 0:
        ages[i] = None

In [11]:
ages.mean()

35.2580046022811

In [12]:
ages[::15] = np.nan

In [13]:
print(ages)

0            NaN
1      35.418675
2      30.783763
3      35.592637
4      31.441039
5      27.215629
6      40.098145
7      44.626544
8      41.342058
9      35.631908
10     33.334627
11     42.264236
12     32.272869
13     43.294791
14           NaN
15           NaN
16     11.717584
17     43.883360
18     32.138184
19     42.436433
20     43.177389
21     49.568468
22     34.063409
23     31.107427
24     29.655758
25     40.267723
26     27.855405
27     27.096122
28     50.964104
29           NaN
         ...    
120          NaN
121    28.015848
122    43.023501
123    25.715934
124    48.490441
125    36.195236
126    27.624210
127    51.118389
128    32.964534
129    52.092552
130    34.957576
131    50.425775
132    43.139111
133    22.498836
134          NaN
135          NaN
136    23.330684
137    56.212364
138    28.068172
139    41.936312
140    34.657038
141    25.313992
142    33.434110
143    19.664907
144    26.040363
145    18.896453
146    22.527969
147    36.7239

In [14]:
ages.mean()

35.30663390015777

In [15]:
ages.median()

35.43341609541869

In [16]:
ages.fillna(ages.mean()).mean()

35.30663390015777

In [17]:
ages.fillna(ages.median()).mean()

35.32353819285923

### Task 8

- Define a function call 'Standardizing' which works on a series as:
    - Find the mean and standard deviation of the series
    - Subtract each value of the series with the mean
    - Divide the result with the standard deviation
    - Returns a series with the results
- Test on a random Series
    - ask the mean and standard deviation of the standardized series
- Apply this function to each numerical column (int or float) of the Titanic Dataset
    - Once by manually checking which columns are float or int
    - Once by letting Python decide 
    - ask the mean and standard deviation of the standardized series
    
Standardizing is a very important technique in statistics, more on this later!

In [18]:
def standardize_s(ser):
    STser = (ser-ser.mean())/ser.std()
    return STser

In [28]:
ser1=pd.Series([1,2,3.1,4,5,6,7.5])

In [29]:
ser1.mean()

4.085714285714286

In [30]:
ser1.median()

4.0

In [31]:
ser1.std()

2.2733445049299585

In [32]:
standardize_s(ser1)

0   -1.357346
1   -0.917465
2   -0.433597
3   -0.037704
4    0.402176
5    0.842057
6    1.501878
dtype: float64

In [33]:
standardize_s(ser1).mean()

-3.1720657846433045e-16

In [34]:
standardize_s(ser1).std()

1.0

In [35]:
df_titanic = pd.read_csv('data/titanic.csv')
df_titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [36]:
standardize_s(df_titanic.PassengerId) #similar for survived, Pclass, age, SinSp, parch and fare

0     -1.729137
1     -1.725251
2     -1.721365
3     -1.717480
4     -1.713594
5     -1.709708
6     -1.705823
7     -1.701937
8     -1.698051
9     -1.694165
10    -1.690280
11    -1.686394
12    -1.682508
13    -1.678623
14    -1.674737
15    -1.670851
16    -1.666966
17    -1.663080
18    -1.659194
19    -1.655308
20    -1.651423
21    -1.647537
22    -1.643651
23    -1.639766
24    -1.635880
25    -1.631994
26    -1.628109
27    -1.624223
28    -1.620337
29    -1.616451
         ...   
861    1.616451
862    1.620337
863    1.624223
864    1.628109
865    1.631994
866    1.635880
867    1.639766
868    1.643651
869    1.647537
870    1.651423
871    1.655308
872    1.659194
873    1.663080
874    1.666966
875    1.670851
876    1.674737
877    1.678623
878    1.682508
879    1.686394
880    1.690280
881    1.694165
882    1.698051
883    1.701937
884    1.705823
885    1.709708
886    1.713594
887    1.717480
888    1.721365
889    1.725251
890    1.729137
Name: PassengerId, Lengt

In [37]:
df_titanic.select_dtypes(include=['float64','int64'])

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.2500
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.9250
3,4,1,1,35.0,1,0,53.1000
4,5,0,3,35.0,0,0,8.0500
5,6,0,3,,0,0,8.4583
6,7,0,1,54.0,0,0,51.8625
7,8,0,3,2.0,3,1,21.0750
8,9,1,3,27.0,0,2,11.1333
9,10,1,2,14.0,1,0,30.0708


In [38]:
df_titanic.select_dtypes(include=['float64','int64']).apply(lambda x : standardize_s(x))

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,-1.729137,-0.788829,0.826913,-0.530005,0.432550,-0.473408,-0.502163
1,-1.725251,1.266279,-1.565228,0.571430,0.432550,-0.473408,0.786404
2,-1.721365,1.266279,0.826913,-0.254646,-0.474279,-0.473408,-0.488580
3,-1.717480,1.266279,-1.565228,0.364911,0.432550,-0.473408,0.420494
4,-1.713594,-0.788829,0.826913,0.364911,-0.474279,-0.473408,-0.486064
5,-1.709708,-0.788829,0.826913,,-0.474279,-0.473408,-0.477848
6,-1.705823,-0.788829,-1.565228,1.672866,-0.474279,-0.473408,0.395591
7,-1.701937,-0.788829,0.826913,-1.906799,2.246209,0.767199,-0.223957
8,-1.698051,1.266279,0.826913,-0.185807,-0.474279,2.007806,-0.424018
9,-1.694165,1.266279,-0.369158,-1.080723,0.432550,-0.473408,-0.042931


In [40]:
df_titanic.select_dtypes(include=['float64','int64']).apply(lambda x : standardize_s(x)).mean() #.round()

PassengerId   -2.317637e-17
Survived      -6.180366e-17
Pclass        -2.360003e-16
Age            2.232497e-16
SibSp         -2.217954e-16
Parch          2.753752e-17
Fare           9.968332e-18
dtype: float64

In [41]:
df_titanic.select_dtypes(include=['float64','int64']).apply(lambda x : standardize_s(x)).std()

PassengerId    1.0
Survived       1.0
Pclass         1.0
Age            1.0
SibSp          1.0
Parch          1.0
Fare           1.0
dtype: float64

### Task 9

Create a function called DESCRIBE_X that takes as input a Series, 
and produces a series containing

- `min, max, sum, mean, std, missings, nonmissings, skew, kurtosis`
- `percentiles - as specified by the user`

so, it looks like:

`DESCRIBE_X(s_1, percentiles=[0.1, 0.3, 0.92, 0.99]`

Then, apply this function to the 1)rows 2) columns of a random constructed 10 by 5 DataFrame.


In [46]:
def DESCRIBE_X(s=pd.Series(), perc=[]):
    """
    
    """
    s1 = pd.Series({'MIN': s.min(), 
                 'MAX': s.max(), 
                 'SUM': s.sum(), 
                 'MEAN': s.mean(), 
                 'MEDIAN': s.median(), 
                 'MISSINGS': s.isnull().sum(), 
                 'NONMISSINGS': s.notnull().sum(),
                 'SKEW': s.skew(), 
                 'KURT': s.kurtosis()})
    s2 = s.quantile(perc)
    return pd.concat([s1, s2])

In [47]:
DESCRIBE_X(s=pd.Series(np.random.random(100)), perc=[0.5, 0.25, 0.75])

MIN              0.013501
MAX              0.985483
SUM             50.848565
MEAN             0.508486
MEDIAN           0.513066
MISSINGS         0.000000
NONMISSINGS    100.000000
SKEW            -0.008368
KURT            -1.260994
0.5              0.513066
0.25             0.264112
0.75             0.746390
dtype: float64

In [42]:
ser1.describe()

count    7.000000
mean     4.085714
std      2.273345
min      1.000000
25%      2.550000
50%      4.000000
75%      5.500000
max      7.500000
dtype: float64

In [48]:
df_1 = pd.DataFrame(np.random.randn(50).reshape(10, 5), columns=['Col_' + str(i) for i in range(5)])
df_1

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
0,-0.059339,1.11941,2.496873,-1.527336,-1.510875
1,-1.270804,0.265423,0.361511,-0.158218,1.864205
2,0.943151,-2.975639,1.154291,-0.463925,0.013872
3,-0.530674,0.970509,-1.56922,0.489549,-0.001683
4,0.259932,1.251442,-0.674431,0.213347,1.245668
5,-0.205076,0.089472,-1.170028,0.812149,-0.738358
6,-0.462805,0.619822,-0.869817,0.955217,1.236143
7,-1.731952,0.669719,-0.257979,-0.053682,0.003356
8,2.027398,-1.228313,-0.259053,-1.605586,-0.459099
9,0.803376,1.145691,-0.760091,0.264123,-0.092886


In [49]:
df_1.apply(lambda c: DESCRIBE_X(s=c, perc=[0.1, 0.2, 0.8, 0.9]))

Unnamed: 0,Col_0,Col_1,Col_2,Col_3,Col_4
MIN,-1.731952,-2.975639,-1.56922,-1.605586,-1.510875
MAX,2.027398,1.251442,2.496873,0.955217,1.864205
SUM,-0.226793,1.927535,-1.547945,-1.074364,1.560343
MEAN,-0.022679,0.192753,-0.154794,-0.107436,0.156034
MEDIAN,-0.132207,0.64477,-0.466742,0.079832,0.000836
MISSINGS,0.0,0.0,0.0,0.0,0.0
NONMISSINGS,10.0,10.0,10.0,10.0,10.0
SKEW,0.30818,-1.830549,1.28216,-0.815121,0.253086
KURT,0.165551,3.149365,1.532771,-0.234328,-0.344368
0.1,-1.316919,-1.403046,-1.209947,-1.535161,-0.81561


In [50]:
df_1.apply(lambda c: DESCRIBE_X(s=c, perc=[0.1, 0.2, 0.8, 0.9]), axis=1)

Unnamed: 0,MIN,MAX,SUM,MEAN,MEDIAN,MISSINGS,NONMISSINGS,SKEW,KURT,0.1,0.2,0.8,0.9
0,-1.527336,2.496873,0.518732,0.103746,-0.059339,0.0,5.0,0.493928,-1.427481,-1.520752,-1.514168,1.394903,1.945888
1,-1.270804,1.864205,1.062116,0.212423,0.265423,0.0,5.0,0.347141,1.494811,-0.82577,-0.380735,0.66205,1.263127
2,-2.975639,1.154291,-1.328251,-0.26565,0.013872,0.0,5.0,-1.405864,2.052301,-1.970954,-0.966268,0.985379,1.069835
3,-1.56922,0.970509,-0.641519,-0.128304,-0.001683,0.0,5.0,-0.660179,0.043811,-1.153802,-0.738383,0.585741,0.778125
4,-0.674431,1.251442,2.295957,0.459191,0.259932,0.0,5.0,-0.387203,-0.972035,-0.31932,0.035791,1.246822,1.249132
5,-1.170028,0.812149,-1.211841,-0.242368,-0.205076,0.0,5.0,0.270757,-0.449203,-0.99736,-0.824692,0.234007,0.523078
6,-0.869817,1.236143,1.47856,0.295712,0.619822,0.0,5.0,-0.469668,-2.398034,-0.707012,-0.544207,1.011402,1.123773
7,-1.731952,0.669719,-1.370539,-0.274108,-0.053682,0.0,5.0,-1.332025,2.775617,-1.142363,-0.552773,0.136628,0.403174
8,-1.605586,2.027398,-1.524654,-0.304931,-0.459099,0.0,5.0,1.425705,2.343074,-1.454677,-1.303768,0.198237,1.112818
9,-0.760091,1.145691,1.360213,0.272043,0.264123,0.0,5.0,-0.328504,-0.792476,-0.493209,-0.226327,0.871839,1.008765


### Task 10

- Report the shape of the Titanic DataFrame and the data type of each column.

- Subset the data to retain only the passengers that survived. Name this object 'survivors'
- Within the 'survivors' find the summary statistics of the 'Age' and 'Fare' variables.

- Subset the survivors data to retain only Female passengers.
- Find out how many females over the age of 30 survived.

- Are there any missing values in the Age columnof the survivors dataframe? Did this impact the previous question?
- Make a copy of the survivors dataframe with the missing Age values replaced with the mean.
- Report the new summary statistics on Age. What happened? Why?
- Find out how now many females over the age of 30 survived now. What changed? Does this make sense? What does this say about the previous result? What if the mean age was 31?
 

In [51]:
df_titanic = pd.read_csv("data/titanic.csv")

In [52]:
df_titanic.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
277,278,0,2,"Parkes, Mr. Francis ""Frank""",male,,0,0,239853,0.0,,S
380,381,1,1,"Bidois, Miss. Rosalie",female,42.0,0,0,PC 17757,227.525,,C
120,121,0,2,"Hickman, Mr. Stanley George",male,21.0,2,0,S.O.C. 14879,73.5,,S


In [53]:
df_titanic.shape

(891, 12)

In [54]:
survivors=df_titanic[df_titanic.Survived==True]

In [55]:
len(survivors)

342

In [56]:
len(df_titanic)

891

In [57]:
survivors['Age'].describe()

count    290.000000
mean      28.343690
std       14.950952
min        0.420000
25%       19.000000
50%       28.000000
75%       36.000000
max       80.000000
Name: Age, dtype: float64

In [58]:
survivors['Fare'].describe()

count    342.000000
mean      48.395408
std       66.596998
min        0.000000
25%       12.475000
50%       26.000000
75%       57.000000
max      512.329200
Name: Fare, dtype: float64

In [59]:
female_survivors = survivors.loc[(survivors.Sex == 'female')]
female_survivors

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q


In [60]:
len(survivors.loc[(survivors.Sex == 'female') & (survivors.Age > 30)].index)

83

or

In [63]:
df_titanic.loc[(df_titanic.Sex == 'female') & (df_titanic.Age > 30), 'Survived'].sum()

83

In [78]:
survivors['Age'].unique()

array([38.  , 26.  , 35.  , 27.  , 14.  ,  4.  , 58.  , 55.  ,   nan,
       34.  , 15.  , 28.  ,  3.  , 19.  , 49.  , 29.  , 21.  ,  5.  ,
       17.  , 32.  ,  0.83, 30.  , 33.  , 23.  , 32.5 , 12.  , 24.  ,
       22.  , 16.  , 40.  ,  9.  ,  1.  , 45.  , 44.  , 18.  , 31.  ,
        8.  , 37.  , 50.  , 25.  , 41.  , 63.  , 42.  ,  0.92, 36.  ,
        2.  , 60.  , 39.  , 13.  , 52.  , 48.  ,  0.75, 54.  ,  7.  ,
       62.  , 53.  , 20.  , 80.  , 56.  ,  6.  ,  0.67, 51.  , 43.  ,
       11.  ,  0.42, 47.  ])

In [75]:
len(survivors['Age'].isnull())

342

In [76]:
len(survivors['Age'].notnull())

342

In [74]:
len(survivors['Age'])

342

In [79]:
survivors['Age'].notnull().sum()

290

In [64]:
survivors['Age'].isnull().sum()

52

In [68]:
survivors_NoMisAge = survivors.copy()

In [69]:
survivors_NoMisAge['Age'].fillna(survivors['Age'].mean(), inplace=True)

In [73]:
len(survivors_NoMisAge['Age'])

342

In [67]:
survivors_NoMisAge['Age'].isnull().sum()

0

In [81]:
survivors['Age'].describe()

count    290.000000
mean      28.343690
std       14.950952
min        0.420000
25%       19.000000
50%       28.000000
75%       36.000000
max       80.000000
Name: Age, dtype: float64

In [80]:
survivors_NoMisAge['Age'].describe()

count    342.000000
mean      28.343690
std       13.763871
min        0.420000
25%       21.000000
50%       28.343690
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [82]:
len(survivors_NoMisAge.loc[(df_titanic.Sex == 'female') & (survivors_NoMisAge.Age > 30)].index)

83

In [83]:
df_titanic.loc[(df_titanic.Sex == 'female') & (df_titanic.Age > 30), 'Survived'].sum()

83