## In this tutorial, we will use PySyft to study Breast Cancer Data. In our scenario - briefly summarised in the picture above - there will be two main characters:

Rachel, Data Scientist:
Rachel is a Data Scientist, and researcher who is working on a project using Machine Learning to study breast cancer data. To do so, Rachel would like to use the (non-public) “Breast Cancer Biormaker” dataset that has been made available on the Cancer Research Centre Datasite.

Owen, Data Owner:
Owen is a laboratory data manager in the Cancer Biomarker Research group. Owen is responsibile to organise, and curate the database of clinical data collected from anonymised patient samples. Due to legal and regulatory constraints, this dataset cannot be made publicly available, nor any of its copy can leave the premises of their research centre. Nonetheless Owen is very keen on allowing researchers to feature the “Breast Cancer Biomarker” dataset in their projects. So Owen sets up a PySyft Datasite hosting the dataset. As Data Owner, Owen will be responsible to

- upload the data

- manage credentials and user profiles

- review any project proposal submitted by external data scientists.

## Part 1: Datasets and Assets

### 1.1. Launch a local development Datasite

In [1]:
import syft as sy

The syft.orchestra.launch functions runs a special local Datasite server, that is only intended for development purposes. Each server is identified by its unique name, which is used by PySyft to restore its internal state in case of rebooting. We will use the reset=True option to make sure that the server instance will be initialised for the first time.

In [2]:
data_site = sy.orchestra.launch(name="cancer-research-centre", reset=True)

In [3]:
# As initial first step, Owen will use the default admin credentials to login to the Datasite.
client = data_site.login(email="info@openmined.org", password="changethis")

Logged into <cancer-research-centre: High side Datasite> as <info@openmined.org>


### 1.2. Downloading our example dataset

In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata 
metadata = breast_cancer_wisconsin_diagnostic.metadata
# variable information 
variables = breast_cancer_wisconsin_diagnostic.variables

In [5]:
X.head(n=5)  # n specifies how many rows we want in the preview

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [6]:
X.shape

(569, 30)

In [7]:
y.sample(n=5, random_state=10)

Unnamed: 0,Diagnosis
172,M
553,B
374,B
370,M
419,B


### 1.3. Create Assets and Dataset

Pysyft will host the real data; second, it will host mock data, that is a fake version of the real data that data scientists can download and see.