# Basic operations

## Print "Hello world"

In [1]:
println("Hello world")

Hello world


## Create a vector <em>A</em> with three elements : 1,2,3

In [2]:
A = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

In [3]:
[1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

In [4]:
A = [1 2 3]

1×3 Array{Int64,2}:
 1  2  3

## Access the first element of your vector

In [5]:
A[1] # Julia is 1-indexed

1

## Create a matrix <em>B</em> with 2 x 3 dimensions : 1, 2, 3 in first row and 4, 5, 6 in the second one

In [6]:
B = [1 2 3 ; 4 5 6] # Semi-colon is used to change rows, space to seperate columns

2×3 Array{Int64,2}:
 1  2  3
 4  5  6

## Transpose matrix B

In [7]:
B'

3×2 LinearAlgebra.Adjoint{Int64,Array{Int64,2}}:
 1  4
 2  5
 3  6

## Create a loop to calculate factorial 5 (5!) and print the result

In [8]:
fact=1
for i in range(1, stop=5)
    fact = fact*i
end
print(fact)

120

## For numbers from -3 to 3 create an  <em>if-else</em> statement that prints "X is positive" if the number is positive and "X is negative" otherwise

In [9]:
for number in -3:3
    if number >= 0 
      println("$number is positive") 
  else 
      println("$number is negative") 
  end
end

-3 is negative
-2 is negative
-1 is negative
0 is positive
1 is positive
2 is positive
3 is positive


# Dataframes and Machine Learning

## Create dataframe using Dataframe package

### With <em>using</em>, import <em>Pkg</em>, Julia's builtin package manager, and handles operations such as installing, updating and removing packages.

In [10]:
using Pkg

### Using <em>Pkg</em> import <em>DataFrames</em> package

In [11]:
Pkg.add("DataFrames")

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[?25l[2K[?25h[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m


### Create dataframe <em>Capitals</em> with two columns, one with <em>City</em> including <em>Paris, Warsaw </em> and another with <em>Country</em> : <em>France, Poland</em>

In [12]:
Capitals = DataFrame(City = ["Paris", "Warsaw"], Country = ["France", "Poland"])

UndefVarError: UndefVarError: DataFrame not defined

### Access column Countries

### Sort dataframe by Country

In [13]:
sort!(Capitals, cols=:Country)

UndefVarError: UndefVarError: Capitals not defined

In [14]:
Capitals[:Countries] 
# or
# Capitals[2] 
# Capitals.Countries

UndefVarError: UndefVarError: Capitals not defined

## Read CSV

### Specify dataset (train.csv, test.csv) directory with command <em>cd(your/working/directory)</em>

In [15]:
cd("/Users/sandrapietrowska/Documents/xke/julia/julia_data/")

In [16]:
using DataFrames
using CSV
train = CSV.read("train.csv")
test = CSV.read("test.csv")

Unnamed: 0_level_0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome
Unnamed: 0_level_1,String,String⍰,String,String⍰,String,String⍰,Int64
1,LP001015,Male,Yes,0,Graduate,No,5720
2,LP001022,Male,Yes,1,Graduate,No,3076
3,LP001031,Male,Yes,2,Graduate,No,5000
4,LP001035,Male,Yes,2,Graduate,No,2340
5,LP001051,Male,No,0,Not Graduate,No,3276
6,LP001054,Male,Yes,0,Not Graduate,Yes,2165
7,LP001055,Female,No,1,Not Graduate,No,2226
8,LP001056,Male,Yes,2,Not Graduate,No,3881
9,LP001059,Male,Yes,2,Graduate,missing,13633
10,LP001067,Male,No,0,Not Graduate,No,2400


### Use </em>describe</em> function to see the basic statistics 

In [17]:
describe(train)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…
1,Loan_ID,,LP001002,,LP002990,614.0,
2,Gender,,Female,,Male,2.0,13.0
3,Married,,No,,Yes,2.0,3.0
4,Dependents,,0,,3+,4.0,15.0
5,Education,,Graduate,,Not Graduate,2.0,
6,Self_Employed,,No,,Yes,2.0,32.0
7,ApplicantIncome,5403.46,150,3812.5,81000,,
8,CoapplicantIncome,1621.25,0.0,1188.5,41667.0,,
9,LoanAmount,146.412,9,128.0,700,,22.0
10,Loan_Amount_Term,342.0,12,360.0,480,,14.0


### Use <em>names</em> function to see the column names

In [18]:
names(train)

13-element Array{Symbol,1}:
 :Loan_ID          
 :Gender           
 :Married          
 :Dependents       
 :Education        
 :Self_Employed    
 :ApplicantIncome  
 :CoapplicantIncome
 :LoanAmount       
 :Loan_Amount_Term 
 :Credit_History   
 :Property_Area    
 :Loan_Status      

### Use <em>size</em> function to see the size of train set

In [19]:
size(train)

(614, 13)

### Use <em>first</em> to display 10 rows

In [20]:
first(train, 10)

Unnamed: 0_level_0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome
Unnamed: 0_level_1,String,String⍰,String⍰,String⍰,String,String⍰,Int64
1,LP001002,Male,No,0,Graduate,No,5849
2,LP001003,Male,Yes,1,Graduate,No,4583
3,LP001005,Male,Yes,0,Graduate,Yes,3000
4,LP001006,Male,Yes,0,Not Graduate,No,2583
5,LP001008,Male,No,0,Graduate,No,6000
6,LP001011,Male,Yes,2,Graduate,Yes,5417
7,LP001013,Male,Yes,0,Not Graduate,No,2333
8,LP001014,Male,Yes,3+,Graduate,No,3036
9,LP001018,Male,Yes,2,Graduate,No,4006
10,LP001020,Male,Yes,1,Graduate,No,12841


### Calculate average income by gender using <em>by</em> function

In [21]:
by(train, :Gender, :ApplicantIncome => mean)

UndefVarError: UndefVarError: mean not defined

### Drop missing values

In [22]:
train_nona = dropmissing(train)
test_nona = dropmissing(test)

Unnamed: 0_level_0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome
Unnamed: 0_level_1,String,String,String,String,String,String,Int64
1,LP001015,Male,Yes,0,Graduate,No,5720
2,LP001022,Male,Yes,1,Graduate,No,3076
3,LP001031,Male,Yes,2,Graduate,No,5000
4,LP001051,Male,No,0,Not Graduate,No,3276
5,LP001054,Male,Yes,0,Not Graduate,Yes,2165
6,LP001055,Female,No,1,Not Graduate,No,2226
7,LP001056,Male,Yes,2,Not Graduate,No,3881
8,LP001067,Male,No,0,Not Graduate,No,2400
9,LP001078,Male,No,0,Not Graduate,No,3091
10,LP001096,Female,No,0,Graduate,No,4666


## Label encoder

In [23]:
Pkg.add("ScikitLearn")
using ScikitLearn 
@sk_import preprocessing: LabelEncoder 

[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m


│   caller = import_sklearn() at Skcore.jl:120
└ @ ScikitLearn.Skcore /Users/sandrapietrowska/.julia/packages/ScikitLearn/HK6Vs/src/Skcore.jl:120
│   caller = top-level scope at Skcore.jl:158
└ @ Core /Users/sandrapietrowska/.julia/packages/ScikitLearn/HK6Vs/src/Skcore.jl:158


PyObject <class 'sklearn.preprocessing.label.LabelEncoder'>

### Apply <em>fit_transform!</em> to each category

In [24]:
labelencoder = LabelEncoder() 
categories = [2 3 4 5 6 12 13] 

1×7 Array{Int64,2}:
 2  3  4  5  6  12  13

In [25]:
for col in categories 
     train_nona[col] = fit_transform!(labelencoder, train_nona[col]) 
end

│   caller = #fit_transform!#25(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::PyCall.PyObject, ::Array{String,1}) at Skcore.jl:100
└ @ ScikitLearn.Skcore /Users/sandrapietrowska/.julia/packages/ScikitLearn/HK6Vs/src/Skcore.jl:100


### Using convert function create label, train and test arrays

In [26]:
predictors = [:Gender, :Married, :Dependents, :Education, :Self_Employed, 
    :Loan_Amount_Term, :Credit_History, :Property_Area, :LoanAmount]

9-element Array{Symbol,1}:
 :Gender          
 :Married         
 :Dependents      
 :Education       
 :Self_Employed   
 :Loan_Amount_Term
 :Credit_History  
 :Property_Area   
 :LoanAmount      

In [27]:
y = convert(Array, train_nona[:13]) 
X = convert(Array, train_nona[predictors]) 
X2 = convert(Array, test_nona[predictors])       

│   caller = top-level scope at In[27]:2
└ @ Core In[27]:2
│   caller = top-level scope at In[27]:3
└ @ Core In[27]:3


289×9 Array{Any,2}:
 "Male"    "Yes"  "0"   "Graduate"      "No"   360  1  "Urban"      110
 "Male"    "Yes"  "1"   "Graduate"      "No"   360  1  "Urban"      126
 "Male"    "Yes"  "2"   "Graduate"      "No"   360  1  "Urban"      208
 "Male"    "No"   "0"   "Not Graduate"  "No"   360  1  "Urban"       78
 "Male"    "Yes"  "0"   "Not Graduate"  "Yes"  360  1  "Urban"      152
 "Female"  "No"   "1"   "Not Graduate"  "No"   360  1  "Semiurban"   59
 "Male"    "Yes"  "2"   "Not Graduate"  "No"   360  0  "Rural"      147
 "Male"    "No"   "0"   "Not Graduate"  "No"   360  1  "Semiurban"  123
 "Male"    "No"   "0"   "Not Graduate"  "No"   360  1  "Urban"       90
 "Female"  "No"   "0"   "Graduate"      "No"   360  1  "Semiurban"  124
 "Male"    "No"   "1"   "Graduate"      "No"   360  1  "Urban"      131
 "Male"    "Yes"  "2"   "Graduate"      "No"   360  1  "Urban"      200
 "Male"    "Yes"  "3+"  "Graduate"      "No"   360  1  "Semiurban"  126
 ⋮                                          

## Fit model

In [28]:
model = LogisticRegression()
fit!(model, X, y) 

UndefVarError: UndefVarError: LogisticRegression not defined

In [29]:
predictions = predict(model, X) 

UndefVarError: UndefVarError: model not defined

## Evaluation

### Calculate accuracy

In [30]:
accuracy = accuracy_score(predictions, y) 
println("Accuracy : ", accuracy) 

UndefVarError: UndefVarError: accuracy_score not defined

### Use <em>cross_val_score</em> to do cross validation with 5 folds

In [31]:
cross_score = cross_val_score(model, X, y, cv=5)    

UndefVarError: UndefVarError: cross_val_score not defined