Install cassandra in Ambari:

- make sure your python version is 2.7.x
- log in to HDP sandbox and do the following
```
cd /etc/yum.repos.d
vi datastax.repo
```
and add the following to datastax.repo

```
[datastax]
name = Datastax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck=0
```
Then do the following:
```
yum install dsc30
pip install cql
service cassandra start
```
Now cassandra service is started, run cqlsh to start creating a keyspace (i.e. like a database in SQL)

```
cqlsh
CREATE KEYSPACE movielens WITH replication = {'class':'SimpleStrategy','replication_factor':'1'} AND durable_writes = true;
```
note that single quote is really important, dont use double quote.

Now that database movielens is created, type "USE movielens" to create a table within:
```
CREATE TABLE users (user_id int, age int, gender text, occupation text, zip text, PRIMARY KEY (user_id));
DESCRIBE TABLE users
```


Use Spark to create table in Cassandra

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions

def parseInput(line):
    fields = line.split('|')
    return Row(user_id = int(fields[0]), age = int(fields[1]), gender = fields[2], occupation = fields[3], zip = fields[4]) #row name needs to match with Cassandra table

if __name__ == "__main__":
    # Create a SparkSession
    spark = SparkSession.builder.appName("CassandraIntegration").config("spark.cassandra.connection.host", "127.0.0.1").getOrCreate()

    # Get the raw data
    lines = spark.sparkContext.textFile("hdfs://127.0.0.1:8020/user/maria_dev/ml-100k/u.data")
    # Convert it to a RDD of Row objects with (userID, age, gender, occupation, zip)
    users = lines.map(parseInput)
    # Convert that to a DataFrame
    usersDataset = spark.createDataFrame(users)

    # Write it into Cassandra
    usersDataset.write\
        .format("org.apache.spark.sql.cassandra")\
        .mode('append')\
        .options(table="users", keyspace="movielens")\
        .save()

    # Read it back from Cassandra into a new Dataframe
    readUsers = spark.read\
    .format("org.apache.spark.sql.cassandra")\
    .options(table="users", keyspace="movielens")\
    .load()

    readUsers.createOrReplaceTempView("users")

    sqlDF = spark.sql("SELECT * FROM users WHERE age < 20")
    sqlDF.show()

    # Stop the session
    spark.stop()


run below in the shell
```
spark-submit --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 CassandraSpark.py
```
To double check, you can also do a CQLSH to check that the data is stored into the cassandra table
```
CQLSH
USE movielens;
SELECT * FROM users LIMIT 20;
```