Skip to content

Latest commit

 

History

History
191 lines (146 loc) · 7.4 KB

README.en-US.md

File metadata and controls

191 lines (146 loc) · 7.4 KB

English| 简体中文

Alink

Alink is the Machine Learning algorithm platform based on Flink, developed by the PAI team of Alibaba computing platform. Welcome everyone to join the Alink open source user group to communicate.

List of Algorithms

PyAlink

Quick start

PyAlink Manual

Preparation before use:


About package names and versions:

  • PyAlink provides different Python packages for Flink versions that Alink supports: package pyalink always maintains Alink Python API against the latest Flink version, which is 1.10, while pyalink-flink-*** support old-version Flink, which are pyalink-flink-1.9 for now.
  • The version of python packages always follows Alink Java version, like 1.1.0.

Installation steps:

  1. Make sure the version of python3 on your computer is 3.6 or 3.7.
  2. Make sure Java 8 is installed on your computer.
  3. Use pip to install: pip install pyalink or pip install pyalink-flink-1.9 (Note: for now, pyalink-flink-1.9 is not available,use following links instead).

Potential issues:

  1. pyalink and/or pyalink-flink-*** can not be installed at the same time. Multiple versions are not allowed. If pyalink or pyalink-flink-*** was/were installed, please use pip uninstall pyalink or pip uninstall pyalink-flink-*** to remove them.

  2. If pip install is slow of failed, refer to this article to change the pip source, or use the following download links:

    • Flink 1.10:Link 1 Link 2 (MD5: b0541ea013e0ceae47d6961149d2c46f)
    • Flink 1.9: Link 1 Link 2 (MD5: fca8937ff724734dc3bcd27d12cdc997)
  3. If multiple version of Python exist, you may need to use a special version of pip, like pip3; If Anaconda is used, the command should be run in Anaconda prompt.

Start using:


We recommend using Jupyter Notebook to use PyAlink to provide a better experience.

Steps for usage:

  1. Start Jupyter: jupyter notebook in terminal , and create Python 3 notebook.

  2. Import the pyalink package: from pyalink.alink import *.

  3. Use this command to create a local runtime environment:

    useLocalEnv(parallism, flinkHome=None, config=None).

    Among them, the parameter parallism indicates the degree of parallelism used for execution;flinkHome is the full path of flink,and the default flink-1.9.0 path of PyAlink is used; config is the configuration parameter accepted by Flink. After running, the following output appears, indicating that the initialization of the running environment is successful.

JVM listening on ***
Python listening on ***
  1. Start writing PyAlink code, for example:
source = CsvSourceBatchOp()\
    .setSchemaStr("sepal_length double, sepal_width double, petal_length double, petal_width double, category string")\
    .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/iris.csv")
res = source.select(["sepal_length", "sepal_width"])
df = res.collectToDataframe()
print(df)

Write code:


In PyAlink, the interface provided by the algorithm component is basically the same as the Java API, that is, an algorithm component is created through the default construction method, then the parameters are set through setXXX, and other components are connected through link / linkTo / linkFrom.

Here, Jupyter's auto-completion mechanism can be used to provide writing convenience.

For batch jobs, you can trigger execution through methods such as print / collectToDataframe / collectToDataframes of batch components or BatchOperator.execute (); for streaming jobs, start the job with StreamOperator.execute ().

More usage:


Java API Manual

KMeans Example

String URL = "https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/iris.csv";
String SCHEMA_STR = "sepal_length double, sepal_width double, petal_length double, petal_width double, category string";

BatchOperator data = new CsvSourceBatchOp()
        .setFilePath(URL)
        .setSchemaStr(SCHEMA_STR);

VectorAssembler va = new VectorAssembler()
        .setSelectedCols(new String[]{"sepal_length", "sepal_width", "petal_length", "petal_width"})
        .setOutputCol("features");

KMeans kMeans = new KMeans().setVectorCol("features").setK(3)
        .setPredictionCol("prediction_result")
        .setPredictionDetailCol("prediction_detail")
        .setReservedCols("category")
        .setMaxIter(100);

Pipeline pipeline = new Pipeline().add(va).add(kMeans);
pipeline.fit(data).transform(data).print();

With Flink-1.10

<dependency>
    <groupId>com.alibaba.alink</groupId>
    <artifactId>alink_core_flink-1.10_2.11</artifactId>
    <version>1.1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_2.11</artifactId>
    <version>1.10.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

With Flink-1.9

<dependency>
    <groupId>com.alibaba.alink</groupId>
    <artifactId>alink_core_flink-1.9_2.11</artifactId>
    <version>1.1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_2.11</artifactId>
    <version>1.9.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table-planner_2.11</artifactId>
    <version>1.9.0</version>
</dependency>

Run Alink Algorithm with a Flink Cluster

  1. Prepare a Flink Cluster:
  wget https://archive.apache.org/dist/flink/flink-1.10.0/flink-1.10.0-bin-scala_2.11.tgz
  tar -xf flink-1.10.0-bin-scala_2.11.tgz && cd flink-1.10.0
  ./bin/start-cluster.sh
  1. Build Alink jar from the source:
  git clone https://github.com/alibaba/Alink.git
  cd Alink && mvn -Dmaven.test.skip=true clean package shade:shade
  1. Run Java examples:
  ./bin/flink run -p 1 -c com.alibaba.alink.ALSExample [path_to_Alink]/examples/target/alink_examples-1.1-SNAPSHOT.jar
  # ./bin/flink run -p 2 -c com.alibaba.alink.GBDTExample [path_to_Alink]/examples/target/alink_examples-1.1-SNAPSHOT.jar
  # ./bin/flink run -p 2 -c com.alibaba.alink.KMeansExample [path_to_Alink]/examples/target/alink_examples-1.1-SNAPSHOT.jar