# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 0: Preparation and import data from s3
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
#Before we continue, we need to install related python package.
import sys

!{sys.executable} -m pip install boto3
!{sys.executable} -m pip install s3fs
!{sys.executable} -m pip install pyspark
!{sys.executable} -m pip install cqlsh
!{sys.executable} -m pip install findspark
!{sys.executable} -m pip install pyarrow

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
# Do all imports and installs here
import configparser
import pandas as pd
import os
import boto3
import uuid
from pyspark.sql import types as T
from time import sleep

In [2]:
config = configparser.ConfigParser()
config.read('iam.cfg')
os.environ['AWS_ACCESS_KEY_ID']=config['AWS_CREDS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_CREDS']['AWS_SECRET_ACCESS_KEY']

client=boto3.client('s3')


# # Set spark environments
# os.environ['PYSPARK_PYTHON'] = '/usr/local/bin/python3'
# os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/local/bin/python3'

### Scope the Project and Gather Data

#### Project description:

This project will be separate to multiple parts, and all four dataset will be used. 

Before we talk about the details, we need to know the characteristics of relational DB and non-relational DB.

For relational DB, its characteristics is low redundancy and high completeness, which means it is very suitable for small or medium size data, and the database does not change so much. In our case, we should store temperature, airport code and US cities demographic data into a relational database that meets 3NF because it does not always change so much and the volume of data is not that large.

The final solution will work as a a database management system. When user input the time or time period and the column they interested in (e.g., visa type), the system will return related data as a dataframe. For example, a user needs to know where are the busiest airport for investment visa holder(E-1 visa) in 2016 and its basic information such as temperature, and the status of the city such as population(age, majority race, etc.), or when is the peak-time for international student come to the United States and where are they come from.

* Data will be imported from Amazon S3
* Non-Relationalship DB will be implement on Amazon Keyspace, and data backup will be stored at S3 as parquet format.
* Data cleaning and ETL process will be implement on Amazon EMR with Spark

#### The dataset is going to use in this project are:

* I94 Immigration Data: This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. https://travel.trade.gov/research/reports/i94/historical/2016.html is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
* World Temperature Data: This dataset came from Kaggle. You can read more about it here: https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data.
* U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it here: https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/.
* Airport Code Table: This is a simple table of airport codes and corresponding cities. It comes from here:https://datahub.io/core/airport-codes#data.

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [8]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql import SQLContext
from pyspark.sql import types as T
from pyspark.sql.types import *
from pyspark import SparkContext

spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.driver.memory", "15g")\
.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [4]:
i94 = pd.read_sas('i94_jan16_sub.sas7bdat', 'sas7bdat',encoding="ISO-8859-1").drop_duplicates()
i94['id_'] = pd.Series([str(uuid.uuid1()) for each in range(len(i94))])
i94['arrival_date'] = pd.to_timedelta(i94['arrdate'],unit='D') + pd.Timestamp('1960-1-1')
i94=spark.createDataFrame(i94)
sql="""SELECT i94yr AS year,i94mon AS month,i94cit AS citizenship,
              i94res AS resident,i94port AS port,
              arrival_date,i94mode AS mode,
              i94addr AS us_state,depdate AS depart_date,
              i94bir AS age,i94visa visa_category,
              dtadfile AS date_added,visapost AS visa_issued_by,
              occup AS occupation,entdepa AS arrival_flag,
              entdepd AS depart_flag,entdepu AS update_flag,
              matflag AS match_arrival_depart_flag,
              biryear AS birth_year,dtaddto AS allowed_date,
              gender,insnum AS ins_number,airline,
              admnum AS admission_number,
              fltno AS flight_no,visatype,id_
              FROM i94;
       """
i94_df=spark.sql(sql)

In [503]:
airport_codes_url = 's3://srk-data-eng-capstone/airport-codes_csv.csv'
us_city_demographics_url = 's3://srk-data-eng-capstone/us-cities-demographics.csv'

In [91]:
airport_codes = pd.read_csv(airport_codes_url)
airport_codes = spark.createDataFrame(airport_codes)
airport_codes.createOrReplaceTempView('airports')
sql = """SELECT ident, type, name, elevation_ft, continent, 
                iso_country, iso_region, municipality, gps_code, iata_code AS airport_code, coordinates
         FROM airports WHERE iata_code IS NOT NULL
         UNION
         SELECT ident, type, name, elevation_ft, continent,
                iso_country, iso_region, municipality, gps_code, local_code AS airport_code, coordinates
         FROM airports WHERE local_code IS NOT NULL"""
airports = spark.sql(sql)

In [504]:
us_city_demographics=pd.read_csv(us_city_demographics_url, sep=';')
us_city_demographics=spark.createDataFrame(us_city_demographics)
us_city_demographics.createOrReplaceTempView('us_cities')
sql="""SELECT city, `Median Age` AS median_age, `Male Population` AS male_population,
              `Female Population` AS female_population, `Total Population` AS population,
              `Number of Veterans` AS num_veterans, `Foreign-born` AS foreign_born, `Average Household Size` AS avg_household_size,
              `State Code` AS state, race, count
       FROM us_cities"""
us_cities = spark.sql(sql)

In [None]:
def mapping_processor(names):
    origin=open('mappings/{}.txt'.format(names),'r')
    code=[]
    name=[]
    for each in origin:
        line=" ".join(each.split())
        try:
            code.append(int(line[:line.index('=')]))
        except:
            code.append(line[1:line.index('=')-1])
        name.append(line[line.index('=')+2:-1])
    origin.close()
    col_code=names+'_code'
    col_name=names+'_name'
    df=pd.DataFrame(list(zip(code,name)),columns=[col_code,col_name])
    df=spark.createDataFrame(df)
    return df

country=mapping_processor('country')
mode=mapping_processor('mode')
port=mapping_processor('port')
us_states=mapping_processor('us_states')
visacode=mapping_processor('visacode')

In [581]:
country.createOrReplaceTempView('country')
mode.createOrReplaceTempView('mode')
port.createOrReplaceTempView('port')
us_states.createOrReplaceTempView('us_states')
visacode.createOrReplaceTempView('visacode')
i94_df.createOrReplaceTempView('i94')
airports.createOrReplaceTempView('airports')
us_cities.createOrReplaceTempView('us_cities')

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
After analysing all columns from all datasets, I found it is very hard to connect temperature table to the rest of table. Hence, we only choose airport and the US cities demographics. And according to i94 metadata(
https://github.com/dai-dao/udacity-data-engineering-capstone/blob/master/I94_SAS_Labels_Descriptions.SAS), we create mapping data for country, the US states, visacode and the way to cross the US border(mode). 



#### 3.2 Mapping Out Data Pipelines
-Read SAS files to pandas dataframe, and add an ID to all rows(one row represent a cross-boarder report).

-Convert SAS time format to datetime format.

-Convert original column name to the column name easier to understand.

-Combine local airport code and IATA airport code as airport code.

-Convert US city demographic to spark dataframe and modify its column name to easy-read name.

-Define a function for mapping table generate.

-Create table for all tables.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [19]:
import configparser
import pandas as pd
import os
import uuid

from pyspark.sql import SparkSession

def spark_generator():
    spark = SparkSession.builder.\
    config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0")\
    .config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.driver.memory", "15g")\
    .enableHiveSupport().getOrCreate()
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    return spark

def immigration_data(year, month):
    i94 = pd.read_sas(('i94_'+str(month)+str(year)+'_sub.sas7bdat'), 'sas7bdat',encoding="ISO-8859-1").drop_duplicates()
    i94['id_'] = pd.Series([str(uuid.uuid1()) for each in range(len(i94))])
    i94['arrival_date'] = pd.to_timedelta(i94['arrdate'],unit='D') + pd.Timestamp('1960-1-1')
    i94=spark.createDataFrame(i94)
    i94.createOrReplaceTempView('i94')
    sql="""SELECT i94yr AS year,i94mon AS month,i94cit AS citizenship,
              i94res AS resident,i94port AS port,
              arrival_date,i94mode AS mode,
              i94addr AS us_state,depdate AS depart_date,
              i94bir AS age,i94visa visa_category,
              dtadfile AS date_added,visapost AS visa_issued_by,
              occup AS occupation,entdepa AS arrival_flag,
              entdepd AS depart_flag,entdepu AS update_flag,
              matflag AS match_arrival_depart_flag,
              biryear AS birth_year,dtaddto AS allowed_date,
              gender,insnum AS ins_number,airline,
              admnum AS admission_number,
              fltno AS flight_no,visatype,id_
              FROM i94;
       """
    i94_df=spark.sql(sql)
    i94_df.write.mode('overwrite')\
           .partitionBy('month','year')\
           .format('parquet')\
            .option("compression", "gzip")\
            .save('parquet_data/'+str(month) + '_' + str(year) + '.parquet')
    print('i94 parquet generation complete.-' + str(month) + '_' + str(year))

def airport():
    airport_codes_url = 's3://srk-data-eng-capstone/airport-codes_csv.csv'
    airport_codes = pd.read_csv(airport_codes_url)
    airport_codes = spark.createDataFrame(airport_codes)
    airport_codes.createOrReplaceTempView('airports')
    sql = """SELECT ident, type, name, elevation_ft, continent, 
                iso_country, iso_region, municipality, gps_code, iata_code AS airport_code, coordinates
         FROM airports WHERE iata_code IS NOT NULL
         UNION
         SELECT ident, type, name, elevation_ft, continent,
                iso_country, iso_region, municipality, gps_code, local_code AS airport_code, coordinates
         FROM airports WHERE local_code IS NOT NULL"""
    airports = spark.sql(sql)
    airports.write.mode('overwrite')\
            .format('parquet')\
            .option("compression", "gzip")\
            .save('parquet_data/airports.parquet')
    print('Airport parquet generation complete.')
    
def us_cities():
    us_city_demographics_url = 's3://srk-data-eng-capstone/us-cities-demographics.csv'
    us_city_demographics=pd.read_csv(us_city_demographics_url, sep=';')
    us_city_demographics=spark.createDataFrame(us_city_demographics)
    us_city_demographics.createOrReplaceTempView('us_cities')
    sql="""SELECT city, `Median Age` AS median_age, `Male Population` AS male_population,
              `Female Population` AS female_population, `Total Population` AS population,
              `Number of Veterans` AS num_veterans, `Foreign-born` AS foreign_born, `Average Household Size` AS avg_household_size,
              `State Code` AS state, race, count
       FROM us_cities"""
    us_cities = spark.sql(sql)
    us_cities.write.mode('overwrite')\
             .format('parquet')\
             .option('compression','gzip')\
             .save('parquet_data/us_cities.parquet')
    print('US cities parquet generation complete.')
    
def mapping(names):
    origin=open('mappings/{}.txt'.format(names),'r')
    code=[]
    name=[]
    for each in origin:
        line=" ".join(each.split())
        try:
            code.append(int(line[:line.index('=')]))
        except:
            code.append(line[1:line.index('=')-1])
        name.append(line[line.index('=')+2:-1])
    origin.close()
    col_code=names+'_code'
    col_name=names+'_name'
    df=pd.DataFrame(list(zip(code,name)),columns=[col_code,col_name])
    df=spark.createDataFrame(df)
    df.write.mode('overwrite')\
      .format('parquet')\
        .option('compression','gzip')\
        .save('parquet_data/' + names + '.parquet')
    print(names + ' parquet generation complete.')

def upload_files(filename):
    config = configparser.ConfigParser()
    config.read('iam.cfg')
    os.environ['AWS_ACCESS_KEY_ID']=config['AWS_CREDS']['AWS_ACCESS_KEY_ID']
    os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS_CREDS']['AWS_SECRET_ACCESS_KEY']
    
    os.system('aws s3 cp parquet_data/{}.parquet s3://i94-backup --recursive'.format(filename))
    print(filename + ' is uploaded to bucket i94-backup')
    

In [20]:
spark=spark_generator()
immigration_data(16,'jan')
airport()
us_cities()
mapping_list=['country','us_states','visacode','mode']
for each in mapping_list:
    mapping(each)
uploading_list=['jan_16','airports','country','mode','us_cities','us_states','visacode']
for each in uploading_list:
    upload_files(each)

i94 parquet generation complete.-jan_16
Airport parquet generation complete.
US cities parquet generation complete.
country parquet generation complete.
us_states parquet generation complete.
visacode parquet generation complete.
mode parquet generation complete.
jan_16 is uploaded to bucket i94-backup
airports is uploaded to bucket i94-backup
country is uploaded to bucket i94-backup
mode is uploaded to bucket i94-backup
us_cities is uploaded to bucket i94-backup
us_states is uploaded to bucket i94-backup
visacode is uploaded to bucket i94-backup


#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [11]:
spark = spark_generator()
jan_16 = spark.read.parquet('parquet_data/jan_16.parquet')
jan_16.createOrReplaceTempView('i94')
sql = """SELECT DISTINCT id_ FROM i94"""
temp = spark.sql(sql)
sql = """SELECT id_ FROM i94"""
temp2 = spark.sql(sql)
print(temp.count())
print(temp2.count())

2847924
2847924


In [4]:
airport = spark.read.parquet('parquet_data/airports.parquet')
airport.createOrReplaceTempView('airports')
sql = """SELECT airport_code FROM airports WHERE airport_code IS NULL"""
temp = spark.sql(sql)
sql = """SELECT ident FROM airports WHERE ident IS NULL"""
temp2 = spark.sql(sql)
sql = """SELECT ident, airport_code FROM airports"""
temp3 = spark.sql(sql)
sql = """SELECT * FROM airports"""
temp4 = spark.sql(sql)
print(temp.count())
print(temp2.count())
print(temp3.count())
print(temp4.count())


0
0
36100
36100


In [5]:
us_cities = spark.read.parquet('parquet_data/us_cities.parquet')
us_cities.createOrReplaceTempView('us_cities')
sql = """SELECT city FROM us_cities WHERE city IS NULL"""
temp = spark.sql(sql)
sql = """SELECT state FROM us_cities WHERE state IS NULL"""
temp2 = spark.sql(sql)
sql = """SELECT * FROM us_cities"""
temp3 = spark.sql(sql)
sql = """SELECT city, state FROM us_cities"""
temp4 = spark.sql(sql)
print(temp.count())
print(temp2.count())
print(temp3.count())
print(temp4.count())

0
0
2891
2891


#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

<img src="Pasted Graphic 3.png">

In [16]:
print('i94 data. comes from daily record in each port:')
jan_16.printSchema()
print('airport data: ')
airport.printSchema()
print('US cities data: ')
us_cities.printSchema()
print('visacode,us_states and mode view are mapping view.')

i94 data. comes from daily record in each port:
root
 |-- citizenship: double (nullable = true)
 |-- resident: double (nullable = true)
 |-- port: string (nullable = true)
 |-- arrival_date: timestamp (nullable = true)
 |-- mode: double (nullable = true)
 |-- us_state: string (nullable = true)
 |-- depart_date: double (nullable = true)
 |-- age: double (nullable = true)
 |-- visa_category: double (nullable = true)
 |-- date_added: string (nullable = true)
 |-- visa_issued_by: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- arrival_flag: string (nullable = true)
 |-- depart_flag: string (nullable = true)
 |-- update_flag: string (nullable = true)
 |-- match_arrival_depart_flag: string (nullable = true)
 |-- birth_year: double (nullable = true)
 |-- allowed_date: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ins_number: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admission_number: double (nullable = true)
 |-- flight_

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

If the data was increased by 100X, we should use EMR with more nodes or more advanced EMR node, or both.

The data populates a dashboard that must be updated on a daily basis by 7am every day.

In this situation, after 7am, we can import data into a NoSQL database like below:

In [None]:
from cassandra.cluster import Cluster
from ssl import SSLContext, PROTOCOL_TLSv1, CERT_REQUIRED
from cassandra.auth import PlainTextAuthProvider
from cassandra import ConsistencyLevel

ssl_context = SSLContext(PROTOCOL_TLSv1)
ssl_context.load_verify_locations('AmazonRootCA1.pem')
ssl_context.verify_mode = CERT_REQUIRED
auth_provider = PlainTextAuthProvider(username=str(config['APACHE_CASSANDRA_CREDS']['CASSANDRA_USERNAME']), password=str(config['APACHE_CASSANDRA_CREDS']['CASSANDRA_PASSWORD']))
cluster = Cluster(['cassandra.eu-west-1.amazonaws.com'], ssl_context=ssl_context, auth_provider=auth_provider, port=9142)
print('Patient...')
session = cluster.connect()

create_keyspace="""CREATE KEYSPACE IF NOT EXISTS "i94"
                   WITH REPLICATION={'class':'SingleRegionStrategy'}"""
session.execute(create_keyspace)
sleep(10)

create_table="""CREATE TABLE IF NOT EXISTS "i94".i94 (
                                                      year DOUBLE,
                                                      month DOUBLE,
                                                      birth_country DOUBLE,
                                                      resident_country DOUBLE,
                                                      port TEXT,
                                                      arrive_date DOUBLE,
                                                      mode DOUBLE,
                                                      state_code TEXT,
                                                      departure_date DOUBLE,
                                                      age DOUBLE,
                                                      visa DOUBLE,
                                                      date_to_db DOUBLE,
                                                      visa_issued_dep TEXT,
                                                      occupation TEXT,
                                                      arrival_flag TEXT,
                                                      depart_flag TEXT,
                                                      update_flag TEXT,
                                                      match_arrival_depart TEXT,
                                                      birthyear DOUBLE,
                                                      allowed_date TEXT,
                                                      gender TEXT,
                                                      ins_num TEXT,
                                                      airline TEXT,
                                                      admission_number DOUBLE,
                                                      flight_no TEXT,
                                                      visatype TEXT,
                                                      id_ TEXT,
                                                      PRIMARY KEY(id_)
                ) """
session.execute(create_table)
sleep(10)
print('Table well-prepared. you can input data from dataset.')

For non-relational DB, its characteristics is higher elasticity, faster read & write speed and evoving data volume. In our case, we should save I94 data into non-relational DB. Because this piece of data need to make ETL process almost every minutes in real world background, and it need dynamic write and read for real-time data monitoring. 

In [None]:
original_sql="""INSERT INTO "i94".i94 ("cicid","i94yr","i94mon","i94cit","i94res","i94port","arrdate","i94mode","i94addr","depdate",
                              "i94bir","i94visa","count","dtadfile","visapost","occup","entdepa","entdepd","entdepu","matflag",
                              "biryear","dtaddto","gender","insnum","airline","admnum","fltno","visatype","id_")
                              VALUES ({0},{1},{2},{3},{4},'{5}',{6},{7},'{8}',{9},
                                      {10},{11},{12},{13},'{14}','{15}','{16}','{17}','{18}','{19}',
                                      {20},'{21}','{22}','{23}','{24}',{25},'{26}','{27}','{28}')"""

lists=[888,1991,10,999,666,'port_test',9527,777,'addr_test',10,10,10,10,10,"visapost",'occup','entdepa','entdepd',
      'entdepu','mat',1984,'dtaddto','M','insnumber','AerLingus',29,'filtnumber','H1B']
sql=original_sql.format(lists[0],lists[1],lists[2],lists[3],lists[4],lists[5],lists[6],lists[7],lists[8],lists[9],
                       lists[10],lists[11],lists[12],lists[13],lists[14],lists[15],lists[16],lists[17],lists[18],lists[19],
                       lists[20],lists[21],lists[22],lists[23],lists[24],lists[25],lists[26],lists[27],uuid.uuid1())
sql=session.prepare(sql)
sql.consistency_level = ConsistencyLevel.LOCAL_QUORUM
session.execute(sql)

# This part is going to be used in transcript.

# while True:
#     values=input("Insert data. Split values by comma. If data is empty, just input comma. Enter Q for quit.")
#     lists=values.split(',')
#     if len(values) < 28:
#         print('Did you lose something?')
#         continue
#     elif values.lower() == 'Q':
#         print('Quit.')
#         break
#     else:
#         sql=session.format(sql)
#         sql.consistency_level = ConsistencyLevel.LOCAL_QUORUM
#         session.execute(sql.format(lists[0],lists[1],lists[2],lists[3],lists[4],lists[5],lists[6],lists[7],lists[8],lists[9],
#                       lists[10],lists[11],lists[12],lists[13],lists[14],lists[15],lists[16],lists[17],lists[18],lists[19],
#                       lists[20],lists[21],lists[22],lists[23],lists[24],lists[25],lists[26],lists[27],uuid.uuid1()))
#         next_one=input('Done. Do you wish to continue?Y/N')
#         if next_one.lower() == 'y':
#             continue
#         else:
#             print('Thanks. Quit.')
#             break


temp = session.execute('SELECT * FROM i94.i94')
df = pd.DataFrame(temp, columns=['id_','admnum','airline','arrdate','biryear','cicid','count','depdate','dtadtto','dtadfile','entdepa','entdepd','entdepu','fltno','gender','i94addr','i94bir','i94cit','i94mode','i94mon','i94port','i94res','i94visa','i94yr','insnum','matflag','occup','visapost','visatype'])
df.to_parquet('parquet_data/dashboard.parquet.gzip',compression='gzip')
os.system('aws s3 cp parquet_data/dashboard.parquet s3://i94-backup --recursive')

And we run the script above at 7AM everyday to create a parquet file, upload it to S3.

The database needed to be accessed by 100+ people.

Firstly, these 100+ people should have different 